Date |
| ||||||||
Authors | |||||||||
Status | Documenting | ||||||||
Summary |
| ||||||||
Impact | Doctors were not updated and reval was not available for some regions for the whole day. The updates happened sequentially, as below. |
...
Why didn’t the GMC Overnight Sync Job Start on time? - Because RabbitMQ ran out of memory
Why did RabbitMQ run out of memory? - Because of millions of messages in the following queues:
esr.queue.audit.neo
,tis.queue.cdc.created
,reval.queue.gmcsync.connection
(is this the real reason?)Why were there millions of messages in these queues? - NeoAudit containers have a habit of dropping connection to RabbitMQ, CDC rabbit router was unable to consume/publish messages from the queue
Why do the containers drop connection to RabbitMQ? - This requires an investigation, it’s a not uncommon issue where RabbitMQ and ECS drop connections without an obvious cause
Causes
RabbitMQ ran out of memory
esr.queue.audit.neo
andtis.queue.cdc.created
had millions of messages (recorded at about 10:30 am) (Active queues)reval.queue.gmcsync.connection
had 1.6 million messages (Obsolete unused queue)
...
Action Items | Owner | |
---|---|---|
Unbind and delete the queue | ||
Investigate why NeoAudit lost connection to RabbitMQ (habitually)? | ||
Set up alerts for RabbitMQ low memory (Ticket) | ||
(Nice to have) Broader RabbitMQ health check alerting |
...
Don’t leave obsolete queues lying around - especially when they’re still being published to
There’s no such thing as too much alerting