ESR Microservice Best Practices
Goals / Things to consider.
Reliability
REST api calls
If there is a network problem, REST api calls do not hang indefinitely but time out after a configurable amount of time.
If there is a transient error making a REST api call fail, the api call should be re-tried.
See ticket https://hee-tis.atlassian.net/browse/TISNEW-3785
Rabbit Message Processing
Rejected Message Processing
If we get an optimistic lock failure during a MongoDB update while processing the event - an exception should be thrown such that the message should appear back on the same Rabbit Queue for reprocessing.
Currently, If we throw the AmqpRejectAndDontRequeueException
messages are ‘rejected’ straight away and handled by the ‘queue.dlx.policy’ which applies to every queue and sends the messages to the ‘dlx.main' exchange which routes it to the 'esr.dlq.all' queue.
We need to extend our Rabbit Error handing capabilities to include configurable, limited retries such that when a message is finally rejected - the cause of the failure is stored in a header along with the message.
See Ticket https://hee-tis.atlassian.net/browse/TISNEW-3782
Message Processing Control
It might be desirable to be able to control the processing of each message type separately. You can imagine the situation where (for some unexpected reason) we need the Export Service (say) to temporarily suspend the processing of 1 of its several message types. At the moment we can’t do this. A flag for each message type per microservice which can be updated/inspected at runtime might to the job here. This would also help us perform clean shutdowns where we switch all the processing off - wait a minute for all current processing to finish - then shutdown the microservice.
NO TICKET RAISED
REST Based Trigger retry
At the moment, AWS Lamba invokes a REST API call within the EsrInboundDataReaderService service to download and process an inbound ESR file. If there is a failure - a failure will not result in a message on the dead letter queue and a failure will not result in a retry. By having the trigger send a RabbitMQ message back into itself, we can benefit from the support for RabbitMQ error handling and retries.
See Ticket https://hee-tis.atlassian.net/browse/TISNEW-3780
Cron Based Task retry
In the Export Service, there are 2 types of events, Message Events and Cron (scheduled) Events.
When a Message Event is received from Rabbit, if there is an optimistic lock failure, that message can be returned to the Queue and there will be further attempts to re-process that message.
Currently, when a Cron (scheduled) Event occurs, if there is an Optimistic Lock failure, there is no message retry mechanism. We should look at the Cron event generating and sending Rabbit events. By doing this, the Rabbit Events can be re-processed in the event of an optimistic lock failure.MongodDB Transactions
See Ticket https://hee-tis.atlassian.net/browse/TISNEW-3786
MongoDB Transactions
We should investigate the use of MongoDB Transactions. It’s common in the Export Service to update more than one Mongo Document in response to an incoming Message Event. When an Optimistic Lock exception occurs we could have already updated other mongo documents ( in the same or other collections ). Transactions provide a clean and easy to reason about mechanism to sure that either all updates happen or no updates happen.
See Ticket https://hee-tis.atlassian.net/browse/TISNEW-3764
Deployment
The deployment should use Docker resource memory and processor settings rather than that JVM one.
Monitoring
Message Auditing for Analytics
Currently we copy every message sent to the main exchange. The intention is that this will allow us to do some analytics. Work is ongoing to move to a Neo4j graph database which will provide better analytics support.
When a message is rejected by a Rabbit consumer, some details of the rejections are stored within the “x-death” header. See https://www.rabbitmq.com/dlx.html. We need to investigate whether we can add the Java Exception Class and Message to the x-death or other header to support Root Cause Analysis (Why was the message rejected).
Performance Profiling
It’d be great if we could see in a dashboard/graph how long it’s taking to process each message type or more generally help identify performance bottlenecks in the system.
The audit service records each message on the system but it might not be straightforward to derive processing times. We can either programmatically send extra ‘processed’ events or look at something like AWS X-Ray or Spring Cloud Sleuth.
Application Log Messages
We should be able to control the logging level of a running microservice. Ideally via JMX and/or a simple REST call. We should be able to change the logging level of the Root Logger and individual ‘package’ level loggers too.
A microservices log messages should not be stored within the container. For example we could use
https://www.papertrail.com/ 3rd party log management.
If using EC2 to run docker to run containers:
mount EC2 directory into docker container for logs
use Cloudwatch agent to send log messages to cloudwatch logs.
If using AWS Fargate to run containers:
If we send the log messages to something like CloudWatch - we can manage the logs from there and not have to worry about the traditional log file management tasks of log rotation (daily or for size). By using Cloudwatch - we have a central place to view/search log messages.
There are probably too many log statements logged at INFO level. The code needs to be changed, so that more log statements use DEBUG level.
Healthchecks
Ideally each microservice can expose a healthcheck. We should be able to feed this into some monitoring dashboard. These healthchecks will also help Docker etc. automatically maintain the desired number of healthy instances.
Spring Boot Actuator endpoints enabled to allow jmx, toggling of logging level and info on each Micro Service.
Application Metrics
Spring Boot Actuator together with micrometer.io can be used to expose metrics which can be viewed through AWS CloudWatch Metrics.
Slack: https://hee-nhs-tis.slack.com/
Jira issues: https://hee-tis.atlassian.net/issues/?filter=14213