Date |
|
Authors | |
Status | Done |
Summary | The TCS service was repeatedly trying and failing to process an event, causing a large volume of additional log entries. |
Impact | A significant increase in the monthly costs for logging, usually ~$300/month but currently at ~$1300 part way through the month. |
Non-technical Description
The TIS services log a record of their actions to text files which can be used, when needed, to investigate any issues encounter.
A new feature was add to one of our services, which listens for events occurring in TIS Self-Service (a Conditions of Joining being signed) and updates data within TIS to stay in sync. One of the events sent from our test system could not be processed correctly by TIS. Such cases are not uncommon, the expected behaviour is that the event is sent to a separate queue to be manually reviewed but in this case a configuration issue meant that TIS repeatedly retried instead of rejecting the message.
This led to approx. 27.5 million attempts every day over a period of two weeks, causing a significant influx of additional text data and incurring all of the costs associated with processing those logs.
Trigger
The COJ signed event handler added to TCS was misconfigured, so repeatedly retried instead of rejecting when a message failed to process.
Detection
The AWS cost/billing tool showed a large increase in Cloudwatch usage.
Resolution
IllegalArgumentException
now wrapped in anAmqpRejectAndDontRequeueException
to ensure the message does not get requeued.
Timeline
All times in GMT unless indicated
: 14:35 - TCS starts logging excessively due to retry loop
: 10:30 - Additional CloudWatch costs identified via AWS cost/billing tools
: 11:29 - The failing message was moved to a DLQ to stop the logging
: 11:01 - Permanent fix deployed to preprod
: 15:03 - Permanent fix deploy to prod
: ??:?? - Permanent fix deploy to NIMDTA - DEPLOY NOT YET APPROVED
Root Cause(s)
Why was there such excessive logging?
The Rabbit listener implementation did not throw the correct exception type.
Why was the correct exception not throw?
Limited experience with RabbitMQ, expectations that it would work similar to AWS SQS and a tight deadline led to misunderstanding/assumptions of how Rabbit DLQs worked.
Why did it take so long to identify the issue?
Log size alerts were ignored and Sentry did not report any exceptions occurring.
Why were log size alerts ignored?
The log size alerts are too sensitive, resulting in a lot of false positives and, as a result, they may have become background noise. Something we never need to check as it resolves itself once the scheduled job or similar finishes.
Additionally, as the log size alarm is anomaly based it actually “fixed” itself and the excessive logging became the new normal. At this point the alarm went back to “OK” status and we saw no further alerts.
Why did Sentry not report any exceptions occuring?
Sentry is not set up for the TCS service.
Action Items
Action Items | Owner | |
---|---|---|
Fix the Rabbit DLQ usage for COJ signed events | DONE | |
Set up Sentry in TCS | TICKET RAISED | |
Review log size alarm | TICKET RAISED |
Lessons Learned
Allowing an alarm to trigger consistently will lead to people subconsciously ignoring it
Anomaly detection doesn’t work if the anomaly is consistent
Don’t use a new tech for the first time when you’re already up against a deadline - go with what you know and rebuild later, or pull in knowledge from outside the team
0 Comments