Date	23 Oct 2023
Authors	Cai Willis Steven Howard
Status
Summary
Impact

Non-technical Description

The overnight job to synchronize connected doctors from the GMC into our system normally runs at midnight until the early morning. Today it did not run until about 10am, which means the various doctor lists in the revalidation app were not updated.

This occurred because our messaging system failed, so the relevant part of our system did not get the message to start the synchronization until far later than it should have

Trigger

A large volume of messages in RabbitMQ caused an our of memory error, preventing publishers from posting messages to their queues.

Detection

Some DLQ monitoring alerts were flagged in the morning, and a developer spotted the lack of doctors on the revalidation application

Resolution

The sync process had already started up again on its own, so there was no need to actively change anything, although one obsolete queue was purged of a high volume of messages and one service had to be restarted as it had lost connection with RabbitMQ

Timeline

All times in BST unless indicated

22 Oct 2023: 23:56 Restarting Consumer logs appear from recommendation service
22 Oct 2023: 23:56 Suspended all listeners and will no longer accept client connections logs appear from AWS RabbitMQ broker
23 Oct 2023: 00:01memory resource limit alarm set on node and Publishers will be blocked until this alarm clears logs appear from AWS RabbitMQ broker
23 Oct 2023: 00:05 Start message has been sent to start gmc sync log appears in recommendation service, but this message is not delivered until much later
23 Oct 2023: 09:52 GMC Client Service receives a message to start the doctor sync, and the sync begins.

Root Cause(s)

5 Whys

Why didn’t the GMC Overnight Sync Job Start on time? - Because RabbitMQ ran out of memory
Why did RabbitMQ run out of memory? -
Why
Why
Why

Causes

RabbitMQ ran out of memory

Action Items

Action Items	Owner
Unbind and delete the queue `reval.queue.gmcsync.connection` (It’s not currently used by any application)

2023-10-23 GMC sync job stalled for several hours