Summary

Date

12 Feb 2024

Authors

Joseph (Pepe) Kelly

Status

In-Progress

Done

Summary

The level of use while maintenance was taking place put the broker into a state of protecting in-flight information

Jira Legacy

server	System JIRA
serverId	4c843cd5-e5a9-329d-ae88-66091fcfe3c7
key	TIS21-5712

Impact

Applications were unable to fully use the message broker until RabbitMQ stopped synchronising nodes. After that a number of processing changes went through in significant numbers including overnight updates for Revalidation. ESR Applicants and Notifications which would normally have gone on Monday, went on Tuesday. 3 attempts to sign Conditions of Joining failed.

Table of Contents

Non-technical Description

Trigger

.

Detection

.

Resolution

Timeline

...

11 Feb 2024

...

11 Feb 2024

...

12 Feb 2024 - Stopped syncing a third node for of 2 queues being synchronised

...

The message broker is a service that helps to connect systems without needing them all to be 100% available and responding instantly all of the time.

The broker has several coordinated computers (nodes) to ensure it functions properly. In the weekly maintenance period, there were so many “in-flight” messages that when one of the nodes was replaced and needed to catch up with the other two, it used too much of the memory from those 2 remaining nodes. That placed the broker into a protective “quarantine” mode, allowing messages to be read but restricting the publishing of messages.

We use the message broker for most of how we process information to and from ESR, also for some of how we pull information from the GMC and for recording when Conditions of Joining are signed by trainees.

Drawio

mVer	2
simple	0
zoom	1
inComment	0
pageId	3971907585
custContentId	3990323234
diagramDisplayName	MessageBrokerFailure.drawio
lbox	1
contentVer	2
revision	2
baseUrl	https://hee-tis.atlassian.net/wiki
diagramName	MessageBrokerFailure.drawio
pCenter	0
width	1476.5
links
tbstyle
height	357

...

Trigger

Automated maintenance with large queue

...

Detection

Alerts in monitoring channel

...

Resolution

Stopped the synchronisation across nodes in the cluster
Horizontal scaling consumers of queues with lots of message

...

Timeline

11 Feb 2024 22:00-23:59 - Message Broker maintenance window
11 Feb 2024 22:50 - Broker nodes begin being restarted
11 Feb 2024 22:56 - One of the secondary nodes becomes unreachable from the primary
11 Feb 2024 23:28 - Primary and 1 secondary enter an “Memory Alarm”state, although memory reported for the broker did not reflect this
12 Feb 2024 08:41 - First of 3 failed attempts of a trainee to digitally sign conditions of joining
12 Feb 2024 08:45 - Picked up alert and began identifying the issue. We paused some of the feeds in order to free up some
12 Feb 2024 12:11 - Cancelled Sync of largest queue
12 Feb 2024 12:13 - Other queue finished synchronising
18 Feb 2024 - Further maintenance window forced synchronisation, this time with much smaller queues

5 Whys (or other analysis of Root Cause)

We received “connection timeout” alerts because the message broker was unavailable.
The broker was in a “Memory Quarantine State”
There was a node started and required synchronising over 1.5M messages
Message broker underwent maintenance (restarts/replacement) with too many messages queued up
2 microservices with audit queues had stopped consuming messages.

...

Action Items

Action Items

Owner

Comments

Simulate the failure / test pre-emptive alerting

Joseph (Pepe) Kelly

Mini-hack PoC prep

Rearchitect:

Splitting the messaging to limit the impact of spread?
SQS instead of RabbitMQ
Make neoaudit containers:
- re-establish connection
- prefetch / batch consuming
PoC to show how flow control would be implemented

Mini-hack PoCs & improvements to python scripts:

e.g. time-based scaling of audit consumers (based on when messages are usually published)

Alert on a certain queue size

...

Versions Compared

Old Version 1

New Version Current

Key

Non-technical Description

Trigger

Detection

Resolution

Timeline

Trigger

Detection

Resolution

Timeline

5 Whys (or other analysis of Root Cause)

Action Items

Lessons Learned

Page Comparison

Versions Compared

Old Version 1

New Version Current

Key

Non-technical Description

Trigger

Detection

Resolution

Timeline

Trigger

Detection

Resolution

Timeline

5 Whys (or other analysis of Root Cause)

Action Items

Lessons Learned