Date	18 May 2023
Authors	Andy Dingley
Status	Done
Summary	The TCS service was repeatedly trying and failing to process an event, causing a large volume of additional log entries.
Impact	A significant increase in the monthly costs for logging, usually ~$300/month but currently at ~$1300 part way through the month.

Non-technical Description

The TIS services log a record of their actions to text files which can be used, when needed, to investigate any issues encounter.

A new feature was add to one of our services, which listens for events occurring in TIS Self-Service (a Conditions of Joining being signed) and updates data within TIS to stay in sync. One of the events sent from our test system could not be processed correctly by TIS. Such cases are not uncommon, the expected behaviour is that the event is sent to a separate queue to be manually reviewed but in this case a configuration issue meant that TIS repeatedly retried instead of rejecting the message.

This led to approx. 27.5 million attempts every day over a period of two weeks, causing a significant influx of additional text data and incurring all of the costs associated with processing those logs.

Trigger

The COJ signed event handler added to TCS was misconfigured, so repeatedly retried instead of rejecting when a message failed to process.

Detection

The AWS cost/billing tool showed a large increase in Cloudwatch usage.

Resolution

IllegalArgumentException now wrapped in an AmqpRejectAndDontRequeueException to ensure the message does not get requeued.

Timeline

All times in GMT unless indicated

03 May 2023: 14:35 - TCS starts logging excessively due to retry loop
18 May 2023: 10:30 - Additional CloudWatch costs identified via AWS cost/billing tools
18 May 2023: 11:29 - The failing message was moved to a DLQ to stop the logging
19 May 2023: 11:01 - Permanent fix deployed to preprod
19 May 2023: 15:03 - Permanent fix deploy to prod
19 May 2023: ??:?? - Permanent fix deploy to NIMDTA - DEPLOY NOT YET APPROVED

Root Cause(s)

Why was there such excessive logging?

The Rabbit listener implementation did not throw the correct exception type.

Why was the correct exception not throw?

Limited experience with RabbitMQ, expectations that it would work similar to AWS SQS and a tight deadline led to misunderstanding/assumptions of how Rabbit DLQs worked.

Why did it take so long to identify the issue?

Log size alerts were ignored and Sentry did not report any exceptions occurring.

Why were log size alerts ignored?

The log size alerts are too sensitive, resulting in a lot of false positives and, as a result, they may have become background noise. Something we never need to check as it resolves itself once the scheduled job or similar finishes.

Additionally, as the log size alarm is anomaly based it actually “fixed” itself and the excessive logging became the new normal. At this point the alarm went back to “OK” status and we saw no further alerts.

Why did Sentry not report any exceptions occuring?

Sentry is not set up for the TCS service.

Action Items

Action Items	Owner
Fix the Rabbit DLQ usage for COJ signed events	Andy Dingley	DONE
Set up Sentry in TCS	TIS21-4560 - Getting issue details... STATUS	TICKET RAISED
Review log size alarm	TIS21-4562 - Getting issue details... STATUS	TICKET RAISED

Lessons Learned

Allowing an alarm to trigger consistently will lead to people subconsciously ignoring it
Anomaly detection doesn’t work if the anomaly is consistent
Don’t use a new tech for the first time when you’re already up against a deadline - go with what you know and rebuild later, or pull in knowledge from outside the team

2023-05-18 Excessive logging caused significantly increased costs