Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Date

Authors

Cai Willis Steven Howard

Status

Documenting

Summary

Jira Legacy
serverSystem JIRA
serverId4c843cd5-e5a9-329d-ae88-66091fcfe3c7
keyTIS21-5261

Impact

Doctors were not updated and reval was not available for some regions for the whole day. The updates happened sequentially, as below.
1-1RSG4X0 Yorkshire and The Humber
1-1RSSPZ7 East Midlands
1-1RSSQ05, East of England
1-1RSSQ1B North East
1-1RSSQ2H North West 1200 confirmed docs available
1-1RSSQ5L South London
1-1RSSQ6R Thames Valley
1-1RUZUSF Wessex
1-1RUZUVB South West 1307 reported docs still missing
1-1RUZUYF West Midlands
1-1RUZV1D Kent, Surrey and Sussex
1-1RUZV4H North Central East London
1-1RUZV6H North West London

...

  • : 23:56 Restarting Consumer logs appear from recommendation service

  • : 23:56 Suspended all listeners and will no longer accept client connections logs appear from AWS RabbitMQ broker

  • : 00:01memory resource limit alarm set on node and Publishers will be blocked until this alarm clears logs appear from AWS RabbitMQ broker

  • : 00:05 Start message has been sent to start gmc sync log appears in recommendation service, but this message is not delivered until much later

  • : 09:44-ish Messages shovelled from ESR DLQ, possibly freeing up a little space?

  • : 09:52 GMC Client Service receives a message to start the doctor sync, and the sync begins.4

  • : 11:00-ish reval.queue.gmcsync.connection purged of 1.6 million messages

  • : Functionality restored (overnight sync job ran successfully)

Root Cause(s)

5 Whys

  1. Why didn’t the GMC Overnight Sync Job Start on time? - Because RabbitMQ ran out of memory

  2. Why did RabbitMQ run out of memory? - Because of millions of messages in the following queues: esr.queue.audit.neo, tis.queue.cdc.created, reval.queue.gmcsync.connection (is this the real reason?)

  3. Why were there millions of messages in these queues? - Why so many? Is this normal? Why was it taking so long to process them?

  4. Why

  5. Why

...