Date |
| ||||||||
Authors | |||||||||
Status | Documenting | ||||||||
Summary |
| ||||||||
Impact | Doctors were not updated and reval was not available for some regions for the whole day. The updates happened sequentially, as below. |
...
: 23:56
Restarting Consumer
logs appear from recommendation service: 23:56
Suspended all listeners and will no longer accept client connections
logs appear from AWS RabbitMQ broker: 00:01
memory resource limit alarm set on node
andPublishers will be blocked until this alarm clears
logs appear from AWS RabbitMQ broker: 00:05
Start message has been sent to start gmc sync
log appears in recommendation service, but this message is not delivered until much later: 09:44-ish Messages shovelled from ESR DLQ, possibly freeing up a little space?
: 09:52 GMC Client Service receives a message to start the doctor sync, and the sync begins.4
: 11:00-ish
reval.queue.gmcsync.connection
purged of 1.6 million messages: Functionality restored (overnight sync job ran successfully)
Root Cause(s)
5 Whys
Why didn’t the GMC Overnight Sync Job Start on time? - Because RabbitMQ ran out of memory
Why did RabbitMQ run out of memory? - Because of millions of messages in the following queues:
esr.queue.audit.neo
,tis.queue.cdc.created
,reval.queue.gmcsync.connection
(is this the real reason?)Why were there millions of messages in these queues? - Why so many? Is this normal? Why was it taking so long to process them?
Why
Why
...