Date |
| ||||||||
Authors | |||||||||
Status | Documentation | Summary | Done | ||||||
Summary | The level of use while maintenance was taking place put the broker into a state of protecting in-flight information
| ||||||||
Impact | Applications were unable to fully use the message broker until RabbitMQ stopped synchronising nodes. After that a number of processing changes went through in significant numbers including overnight updates for Revalidation. ESR Applicants and Notifications which would normally have gone on Monday, went on Tuesday. 3 attempts to sign Conditions of Joining failed. |
Table of Contents |
---|
Non-technical Description
The message broker is a service that helps to connect systems without needing them all to be 100% available and responding instantly all of the time.
The broker has several coordinated computers (nodes) to ensure it functions properly. In the weekly maintenance period, there were so many “in-flight” messages that when one of the nodes was replaced and needed to catch up with the other two, it used too much of the memory from those 2 remaining nodes. That placed the broker into a protective “quarantine” mode, allowing messages to be read but restricting the publishing of messages.
We use the message broker for most of how we process information to and from ESR, also for some of how we pull information from the GMC and for recording when Conditions of Joining are signed by trainees.
Drawio | ||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
...
Trigger
Automated maintenance with large queue
...
22:00-23:59 - Message Broker maintenance window
22:50 - Broker nodes begin being restarted
22:56 - One of the secondary nodes becomes unreachable from the primary
23:28 - Primary and 1 secondary enter an “Memory Alarm”state, although memory reported for the broker did not reflect this
08:41 - First of 3 failed attempts of a trainee to digitally sign conditions of joining
08:45 - Picked up alert and began identifying the issue. We paused some of the feeds in order to free up some - Stopped the sync process for a (new) node for 1 of 2 queues that was syncing
12:11 - Cancelled Sync of largest queue
12:13 - Other queue finished synchronising
- Further maintenance window forced synchronisation, this time with much smaller queues
...
Action Items | Owner | Comments | |
---|---|---|---|
Can AWS tell us anything about what was happening? Simulate the failure / test pre-emptive alerting | Mini-hack PoC prep | ||
Rearchitect:
| Mini-hack PoCs & improvements to python scripts: e.g. time-based scaling of audit consumers (based on when messages are usually published) | ||
Alert on a certain queue size |
...