Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »

Date

Authors

Joseph (Pepe) Kelly

Status

Documentation

Summary

Impact

Applications were unable to fully use the message broker until RabbitMQ stopped synchronising nodes. After that a number of processing changes went through in significant numbers including overnight updates for Revalidation.

Non-technical Description


Trigger

Automated maintenance with large queue


Detection

Alerts in monitoring channel


Resolution

  1. Stopped the synchronisation across nodes in the cluster

  2. Horizontal scaling consumers of queues with lots of message


Timeline

  • 22:00-23:59 - Message Broker maintenance window

  • 22:50 - Broker nodes begin being restarted

  • 22:56 - One of the secondary nodes becomes unreachable from the primary

  • 23:28 - Primary and 1 secondary enter an “Memory Alarm”state, although memory reported for the broker did not reflect this

  • 08:41 - First of 3 failed attempts of a trainee to digitally sign conditions of joining

  • 08:45 - Picked up alert and began identifying the issue. We paused some of the feeds in order to free up some

  • - Stopped the sync process for a (new) node for 1 of 2 queues that was syncing

  • 12:11 - Cancelled Sync of largest queue

  • 12:13 - Other queue finished synchronising

  • - Further maintenance window forced synchronisation, this time with much smaller queues

5 Whys (or other analysis of Root Cause)

  • We received “connection timeout” alerts because the message broker was unavailable.

  • The broker was in a “Memory Quarantine State”

  • There was a node started and required synchronising over 1.5M messages

  • Message broker underwent maintenance (restarts/replacement) with too many messages queued up

  • 2 microservices with audit queues had stopped consuming messages.


Action Items

Action Items

Owner

Comments

What happened?

What caused the memory alarm?

When was this an issue? (tick) in timeline

Can AWS tell us anything about what was happening?

Simulate the failure / test pre-emptive alerting

Rearchitect:

  • Splitting the messaging to limit the impact of spread?

  • SQS instead of RabbitMQ

  • Make neoaudit containers:

    • re-establish connection

    • prefetch / batch consuming

  • PoC to show how flow control would be implemented

Mini-hack PoCs & improvements to python scripts:

e.g. time-based scaling of audit consumers (based on when messages are usually published)

Alert on a certain queue size


Lessons Learned

  • No labels