2023-10-23 GMC sync job stalled for several hours
Date | Oct 23, 2023 |
Authors | @Cai Willis @Steven Howard |
Status | Done |
Summary | |
Impact | Doctors were not updated and reval was not available for some regions for the whole day. The updates happened sequentially, as below. |
Non-technical Description
The overnight job to synchronize connected doctors from the GMC into our system normally runs at midnight until the early morning. Today it did not run until about 10am, which means the various doctor lists in the revalidation app were not updated.
This occurred because our messaging system failed, so the relevant part of our system did not get the message to start the synchronization until far later than it should have
Trigger
A large volume of messages in RabbitMQ caused an our of memory error, preventing publishers from posting messages to their queues.
Detection
Some DLQ monitoring alerts were flagged in the morning, and a developer spotted the lack of doctors on the revalidation application
Resolution
The sync process had already started up again on its own, so there was no need to actively change anything, although one obsolete queue was purged of a high volume of messages and one service had to be restarted as it had lost connection with RabbitMQ
Timeline
All times in BST unless indicated
Oct 22, 2023: 23:56
Restarting Consumer
logs appear from recommendation serviceOct 22, 2023: 23:56
Suspended all listeners and will no longer accept client connections
logs appear from AWS RabbitMQ brokerOct 23, 2023: 00:01
memory resource limit alarm set on node
andPublishers will be blocked until this alarm clears
logs appear from AWS RabbitMQ brokerOct 23, 2023: 00:05
Start message has been sent to start gmc sync
log appears in recommendation service, but this message is not delivered until much laterOct 23, 2023: 09:44-ish Messages shovelled from ESR DLQ, possibly freeing up a little space?
Oct 23, 2023: 09:52 GMC Client Service receives a message to start the doctor sync, and the sync begins.4
Oct 23, 2023: 11:00-ish
reval.queue.gmcsync.connection
purged of 1.6 million messagesOct 24, 2023: Functionality restored (overnight sync job ran successfully)
Root Cause(s)
5 Whys
Why didn’t the GMC Overnight Sync Job Start on time? - Because RabbitMQ ran out of memory
Why did RabbitMQ run out of memory? - Because of millions of messages in the following queues:
esr.queue.audit.neo
,tis.queue.cdc.created
,reval.queue.gmcsync.connection
(is this the real reason?)Why were there millions of messages in these queues? - NeoAudit containers have a habit of dropping connection to RabbitMQ, CDC rabbit router was unable to consume/publish messages from the queue
Why do the containers drop connection to RabbitMQ? - This requires an investigation, it’s a not uncommon issue where RabbitMQ and ECS drop connections without an obvious cause
Causes
RabbitMQ ran out of memory
esr.queue.audit.neo
andtis.queue.cdc.created
had millions of messages (recorded at about 10:30 am) (Active queues)reval.queue.gmcsync.connection
had 1.6 million messages (Obsolete unused queue)
Action Items
Action Items | Owner |
|
---|---|---|
Unbind and delete the queue | @Cai Willis | DONE |
Investigate why NeoAudit lost connection to RabbitMQ (Ticket) | @Cai Willis | Ticket Created: https://hee-tis.atlassian.net/browse/TIS21-5282 |
Set up alerts for RabbitMQ low memory (Ticket) | @Cai Willis | Ticket Created: https://hee-tis.atlassian.net/browse/TIS21-5283 |
(Nice to have) Broader RabbitMQ health check alerting |
|
|
Lessons Learned
Don’t leave obsolete queues lying around - especially when they’re still being published to
There’s no such thing as too much alerting
Related pages
Slack: https://hee-nhs-tis.slack.com/
Jira issues: https://hee-tis.atlassian.net/issues/?filter=14213