Date	12 Feb 2024
Authors	Joseph (Pepe) Kelly
Status	Documentation
Summary
Impact	Applications were unable to fully use the message broker until RabbitMQ stopped synchronising nodes. After that a number of processing changes went through in significant numbers including overnight updates for Revalidation.

Non-technical Description

Trigger

Automated maintenance with large queue

Alerts in monitoring channel

11 Feb 2024 22:00-23:59 - Message Broker maintenance window
11 Feb 2024 22:50 - Broker nodes begin being restarted
11 Feb 2024 22:56 - One of the secondary nodes becomes unreachable from the primary
11 Feb 2024 23:28 - Primary and 1 secondary enter an “Memory Alarm”state, although memory reported for the broker did not reflect this
12 Feb 2024 08:41 - First of 3 failed attempts of a trainee to digitally sign conditions of joining
12 Feb 2024 08:45 - Picked up alert and began identifying the issue. We paused some of the feeds in order to free up some
12 Feb 2024 - Stopped the sync process for a (new) node for 1 of 2 queues that was syncing
12 Feb 2024 12:11 - Cancelled Sync of largest queue
12 Feb 2024 12:13 - Other queue finished synchronising
18 Feb 2024 - Further maintenance window forced synchronisation, this time with much smaller queues

We received “connection timeout” alerts because the message broker was unavailable.
The broker was in a “Memory Quarantine State”
There was a node started and required synchronising over 1.5M messages
Message broker underwent maintenance (restarts/replacement) with too many messages queued up
2 microservices with audit queues had stopped consuming messages.

Action Items	Owner	Comments
What happened? What caused the memory alarm? When was this an issue? in timeline Can AWS tell us anything about what was happening?


Simulate the failure / test pre-emptive alerting
Rearchitect: Splitting the messaging to limit the impact of spread? SQS instead of RabbitMQ Make neoaudit containers: re-establish connection prefetch / batch consuming PoC to show how flow control would be implemented	Mini-hack PoCs & improvements to python scripts: e.g. time-based scaling of audit consumers (based on when messages are usually published)
Alert on a certain queue size