Date |
| ||||||||
Authors | |||||||||
Status | Done | ||||||||
Summary |
| ||||||||
Impact | Doctors were not updated and reval was not available for some regions for the whole day. The updates happened sequentially, as below. |
Table of Contents |
---|
Non-technical Description
...
: 23:56
Restarting Consumer
logs appear from recommendation service: 23:56
Suspended all listeners and will no longer accept client connections
logs appear from AWS RabbitMQ broker: 00:01
memory resource limit alarm set on node
andPublishers will be blocked until this alarm clears
logs appear from AWS RabbitMQ broker: 00:05
Start message has been sent to start gmc sync
log appears in recommendation service, but this message is not delivered until much later: 09:44-ish Messages shovelled from ESR DLQ, possibly freeing up a little space?
: 09:52 GMC Client Service receives a message to start the doctor sync, and the sync begins.4
: 11:00-ish
reval.queue.gmcsync.connection
purged of 1.6 million messages: Functionality restored (overnight sync job ran successfully)
Root Cause(s)
5 Whys
Why didn’t the GMC Overnight Sync Job Start on time? - Because RabbitMQ ran out of memory
Why did RabbitMQ run out of memory? - Because of millions of messages in the following queues:
esr.queue.audit.neo
,tis.queue.cdc.created
,reval.queue.gmcsync.connection
(is this the real reason?)Why were there millions of messages in these queues? - NeoAudit containers have a habit of dropping connection to RabbitMQ, CDC rabbit router was unable to consume/publish messages from the queue
Why do the containers drop connection to RabbitMQ? - This requires an investigation, it’s a not uncommon issue where RabbitMQ and ECS drop connections without an obvious cause
Causes
RabbitMQ ran out of memory
esr.queue.audit.neo
andtis.queue.cdc.created
had millions of messages (recorded at about 10:30 am) (Active queues)reval.queue.gmcsync.connection
had 1.6 million messages (Obsolete unused queue)
Action Items
Action Items | Owner | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
Unbind and delete the queue | DONE | |||||||||
Investigate why NeoAudit lost connection to RabbitMQ (Ticket) | Ticket Created:
| |||||||||
Set up alerts for RabbitMQ low memory (Ticket) | Ticket Created:
| |||||||||
(Nice to have) Broader RabbitMQ health check alerting |
...
Lessons Learned
Don’t leave obsolete queues lying around - especially when they’re still being published to
There’s no such thing as too much alerting