Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 13 Next »

Date

Authors

Cai Willis Steven Howard

Status

Documenting

Summary

TIS21-5261 - Getting issue details... STATUS

Impact

Doctors were not updated and reval was not available for some regions for the whole day. The updates happened sequentially, as below.
1-1RSG4X0 Yorkshire and The Humber
1-1RSSPZ7 East Midlands
1-1RSSQ05, East of England
1-1RSSQ1B North East
1-1RSSQ2H North West 1200 confirmed docs available
1-1RSSQ5L South London
1-1RSSQ6R Thames Valley
1-1RUZUSF Wessex
1-1RUZUVB South West 1307 reported docs still missing
1-1RUZUYF West Midlands
1-1RUZV1D Kent, Surrey and Sussex
1-1RUZV4H North Central East London
1-1RUZV6H North West London

Non-technical Description

The overnight job to synchronize connected doctors from the GMC into our system normally runs at midnight until the early morning. Today it did not run until about 10am, which means the various doctor lists in the revalidation app were not updated.

This occurred because our messaging system failed, so the relevant part of our system did not get the message to start the synchronization until far later than it should have


Trigger

A large volume of messages in RabbitMQ caused an our of memory error, preventing publishers from posting messages to their queues.

Detection

Some DLQ monitoring alerts were flagged in the morning, and a developer spotted the lack of doctors on the revalidation application


Resolution

The sync process had already started up again on its own, so there was no need to actively change anything, although one obsolete queue was purged of a high volume of messages and one service had to be restarted as it had lost connection with RabbitMQ


Timeline

All times in BST unless indicated

  • : 23:56 Restarting Consumer logs appear from recommendation service

  • : 23:56 Suspended all listeners and will no longer accept client connections logs appear from AWS RabbitMQ broker

  • : 00:01memory resource limit alarm set on node and Publishers will be blocked until this alarm clears logs appear from AWS RabbitMQ broker

  • : 00:05 Start message has been sent to start gmc sync log appears in recommendation service, but this message is not delivered until much later

  • : 09:44-ish Messages shovelled from ESR DLQ, possibly freeing up a little space?

  • : 09:52 GMC Client Service receives a message to start the doctor sync, and the sync begins.4

  • : 11:00-ish reval.queue.gmcsync.connection purged of 1.6 million messages

Root Cause(s)

5 Whys

  1. Why didn’t the GMC Overnight Sync Job Start on time? - Because RabbitMQ ran out of memory

  2. Why did RabbitMQ run out of memory? - Because of millions of messages in the following queues: esr.queue.audit.neo, tis.queue.cdc.created, reval.queue.gmcsync.connection (is this the real reason?)

  3. Why were there millions of messages in these queues? - Why so many? Is this normal? Why was it taking so long to process them?

  4. Why

  5. Why

Causes

  • RabbitMQ ran out of memory

  • esr.queue.audit.neo and tis.queue.cdc.created had millions of messages (recorded at about 10:30 am) (Active queues)

  • reval.queue.gmcsync.connection had 1.6 million messages (Obsolete unused queue)

Action Items

Action Items

Owner

Unbind and delete the queue reval.queue.gmcsync.connection (It’s not currently used by any application)

Cai Willis


Lessons Learned

  • Don’t leave obsolete queues lying around - especially when they’re still being published to

  • No labels

0 Comments

You are not logged in. Any changes you make will be marked as anonymous. You may want to Log In if you already have an account.