Date |
| ||||||||
Authors | |||||||||
Status | Documenting | ||||||||
Summary |
| ||||||||
Impact | Doctors were not updated and reval was not available for some regions for the whole day. The updates happened sequentially, as below. |
...
: 23:56
Restarting Consumer
logs appear from recommendation service: 23:56
Suspended all listeners and will no longer accept client connections
logs appear from AWS RabbitMQ broker: 00:01
memory resource limit alarm set on node
andPublishers will be blocked until this alarm clears
logs appear from AWS RabbitMQ broker: 00:05
Start message has been sent to start gmc sync
log appears in recommendation service, but this message is not delivered until much later: 09:44-ish Messages shovelled from ESR DLQ, possibly freeing up a little space?
: 09:52 GMC Client Service receives a message to start the doctor sync, and the sync begins.4
: 11:00-ish
reval.queue.gmcsync.connection
purged of 1.6 million messages
Root Cause(s)
5 Whys
Why didn’t the GMC Overnight Sync Job Start on time? - Because RabbitMQ ran out of memory
Why did RabbitMQ run out of memory? - Because of millions of messages in the following queues:
esr.queue.audit.neo
,tis.queue.cdc.created
,reval.queue.gmcsync.connection
Why were there millions of messages in these queues? - Why so many? Is this normal? Why was it taking so long to process them?
Why
Why
Causes
RabbitMQ ran out of memory
esr.queue.audit.neo
andtis.queue.cdc.created
had millions of messages (recorded at about 10:30 am) (Active queues)reval.queue.gmcsync.connection
had 1.6 million messages (Obsolete unused queue)
...
Action Items | Owner | |
---|---|---|
Unbind and delete the queue | ||
...