...
There were 573 messages in the dlq until we got the notification in #monitoring-prod Slack channel and shovelled them.
...
Resolution
Processed the files a second time.
...
14:58 noticed the alert: “RabbitMQ PROD too many messages in the DLQ” in #monitoring-prod channel, then shovelled them to another queue “esr.dlq.2023.09.01“
10:32 had a huddle and looked into the messages in queue “esr.dlq.2023.09.01“
12 01 Processed files from 1st Sept (moved messages into the normal flow for processing)
Root Cause(s)
...
The quantity of messages in the dead letter queue showed a variety of types of messages failed. Messages which relied on enrichment from information in TCS seem to be the ones which failed.
TCS was busy but still functional. There were timeouts when TCS requested information about the user making the request.
Profile was experiencing a short spike in CPU usage.
...