Date

23 Oct 2023

Authors

Status

Documenting

Summary

Jira Legacy

server	System JIRA
serverId	4c843cd5-e5a9-329d-ae88-66091fcfe3c7
key	TIS21-5261

Impact

Doctors were not updated and reval was not available for some regions for the whole day. The updates happened sequentially, as below.
1-1RSG4X0 Yorkshire and The Humber
1-1RSSPZ7 East Midlands
1-1RSSQ05, East of England
1-1RSSQ1B North East
1-1RSSQ2H North West 1200 confirmed docs available
1-1RSSQ5L South London
1-1RSSQ6R Thames Valley
1-1RUZUSF Wessex
1-1RUZUVB South West 1307 reported docs still missing
1-1RUZUYF West Midlands
1-1RUZV1D Kent, Surrey and Sussex
1-1RUZV4H North Central East London
1-1RUZV6H North West London

...

22 Oct 2023: 23:56 Restarting Consumer logs appear from recommendation service
22 Oct 2023: 23:56 Suspended all listeners and will no longer accept client connections logs appear from AWS RabbitMQ broker
23 Oct 2023: 00:01memory resource limit alarm set on node and Publishers will be blocked until this alarm clears logs appear from AWS RabbitMQ broker
23 Oct 2023: 00:05 Start message has been sent to start gmc sync log appears in recommendation service, but this message is not delivered until much later
23 Oct 2023: 09:44-ish Messages shovelled from ESR DLQ, possibly freeing up a little space?
23 Oct 2023: 09:52 GMC Client Service receives a message to start the doctor sync, and the sync begins.4
23 Oct 2023: 11:00-ish reval.queue.gmcsync.connection purged of 1.6 million messages

Root Cause(s)

5 Whys

Why didn’t the GMC Overnight Sync Job Start on time? - Because RabbitMQ ran out of memory
Why did RabbitMQ run out of memory? - Because of millions of messages in the following queues: esr.queue.audit.neo, tis.queue.cdc.created, reval.queue.gmcsync.connection
Why were there millions of messages in these queues? - Why so many? Is this normal? Why was it taking so long to process them?
Why
Why

Causes

RabbitMQ ran out of memory
esr.queue.audit.neo and tis.queue.cdc.created had millions of messages (recorded at about 10:30 am) (Active queues)
reval.queue.gmcsync.connection had 1.6 million messages (Obsolete unused queue)

...

Action Items	Owner
Unbind and delete the queue `reval.queue.gmcsync.connection` (It’s not currently used by any application)	Cai Willis

...

Versions Compared

Old Version 10

New Version 11

Key

Root Cause(s)

5 Whys

Causes

Lessons Learned

Page Comparison

Versions Compared

Old Version 10

New Version 11

Key

Root Cause(s)

5 Whys

Causes

Lessons Learned