Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Documenting

Date

Authors

Cai Willis Steven Howard

Status

Done

Summary

Jira Legacy
serverSystem JIRA
serverId4c843cd5-e5a9-329d-ae88-66091fcfe3c7
keyTIS21-5261

Impact

Doctors were not updated and reval was not available for some regions for the whole day. The updates happened sequentially, as below.
1-1RSG4X0 Yorkshire and The Humber
1-1RSSPZ7 East Midlands
1-1RSSQ05, East of England
1-1RSSQ1B North East
1-1RSSQ2H North West 1200 confirmed docs available
1-1RSSQ5L South London
1-1RSSQ6R Thames Valley
1-1RUZUSF Wessex
1-1RUZUVB South West 1307 reported docs still missing
1-1RUZUYF West Midlands
1-1RUZV1D Kent, Surrey and Sussex
1-1RUZV4H North Central East London
1-1RUZV6H North West London

...

  1. Why didn’t the GMC Overnight Sync Job Start on time? - Because RabbitMQ ran out of memory

  2. Why did RabbitMQ run out of memory? - Because of millions of messages in the following queues: esr.queue.audit.neo, tis.queue.cdc.created, reval.queue.gmcsync.connection (is this the real reason?)

  3. Why were there millions of messages in these queues? - Why so many? Is this normal? Why was it taking so long to process them?

  4. Why

  5. WhyNeoAudit containers have a habit of dropping connection to RabbitMQ, CDC rabbit router was unable to consume/publish messages from the queue

  6. Why do the containers drop connection to RabbitMQ? - This requires an investigation, it’s a not uncommon issue where RabbitMQ and ECS drop connections without an obvious cause

Causes

  • RabbitMQ ran out of memory

  • esr.queue.audit.neo and tis.queue.cdc.created had millions of messages (recorded at about 10:30 am) (Active queues)

  • reval.queue.gmcsync.connection had 1.6 million messages (Obsolete unused queue)

...

Action Items

Owner

Unbind and delete the queue reval.queue.gmcsync.connection (It’s not currently used by any application), also remove references to reval.queue.masterdoctorview.updated.connection in init.json

Cai Willis

DONE

Investigate why NeoAudit lost connection to RabbitMQ (Ticket)

Cai Willis

Ticket Created:

Jira Legacy
serverSystem JIRA
serverId4c843cd5-e5a9-329d-ae88-66091fcfe3c7
keyTIS21-5282

Set up alerts for RabbitMQ low memory (Ticket)

Cai Willis

Ticket Created:

Jira Legacy
serverSystem JIRA
serverId4c843cd5-e5a9-329d-ae88-66091fcfe3c7
keyTIS21-5283

(Nice to have) Broader RabbitMQ health check alerting

...

Lessons Learned

  • Don’t leave obsolete queues lying around - especially when they’re still being published to

  • There’s no such thing as too much alerting