2023-05-18 Excessive logging caused significantly increased costs

Date

May 18, 2023

Authors

@Andy Dingley

Status

Done

Summary

The TCS service was repeatedly trying and failing to process an event, causing a large volume of additional log entries.

Impact

A significant increase in the monthly costs for logging, usually ~$300/month but currently at ~$1300 part way through the month.

Non-technical Description

The TIS services log a record of their actions to text files which can be used, when needed, to investigate any issues encounter.

A new feature was added to one of our services, which listens for events occurring in TIS Self-Service (a Conditions of Joining being signed) and updates data within TIS to stay in sync. One of the events sent from our test system could not be processed correctly by TIS. Such cases are not uncommon, the expected behaviour is that the event is sent to a separate queue to be manually reviewed but in this case a configuration issue meant that TIS repeatedly retried instead of rejecting the message.

This led to approx. 27.5 million attempts every day over a period of two weeks, causing a significant influx of additional text data and incurring all of the costs associated with processing those logs.


Trigger

  • The COJ signed event handler added to TCS was misconfigured, so repeatedly retried instead of rejecting when a message failed to process.


Detection

  • The AWS cost/billing tool showed a large increase in Cloudwatch usage.


Resolution

  • IllegalArgumentException now wrapped in an AmqpRejectAndDontRequeueException to ensure the message does not get requeued.


Timeline

All times in GMT unless indicated

  • May 3, 2023: 14:35 - TCS starts logging excessively due to retry loop

  • May 18, 2023: 10:30 - Additional CloudWatch costs identified via AWS cost/billing tools

  • May 18, 2023: 11:29 - The failing message was moved to a DLQ to stop the logging

  • May 19, 2023: 11:01 - Permanent fix deployed to preprod

  • May 19, 2023: 15:03 - Permanent fix deploy to prod

  • May 19, 2023: ??:?? - Permanent fix deploy to NIMDTA - DEPLOY NOT YET APPROVED

Root Cause(s)

Why was there such excessive logging?

The Rabbit listener implementation did not throw the correct exception type.

Why was the correct exception not throw?

Limited experience with RabbitMQ, expectations that it would work similar to AWS SQS and a tight deadline led to misunderstanding/assumptions of how Rabbit DLQs worked.

Why did it take so long to identify the issue?

Log size alerts were ignored and Sentry did not report any exceptions occurring.

Why were log size alerts ignored?

The log size alerts are too sensitive, resulting in a lot of false positives and, as a result, they may have become background noise. Something we never need to check as it resolves itself once the scheduled job or similar finishes.

Additionally, as the log size alarm is anomaly based it actually “fixed” itself and the excessive logging became the new normal. At this point the alarm went back to “OK” status and we saw no further alerts.

Why did Sentry not report any exceptions occuring?

Sentry is not set up for the TCS service.


Action Items

Action Items

Owner

 

Action Items

Owner

 

Fix the Rabbit DLQ usage for COJ signed events

@Andy Dingley

DONE

Set up Sentry in TCS

https://hee-tis.atlassian.net/browse/TIS21-4560

TICKET RAISED

Review log size alarm

https://hee-tis.atlassian.net/browse/TIS21-4562

TICKET RAISED


Lessons Learned

  • Allowing an alarm to trigger consistently will lead to people subconsciously ignoring it

  • Anomaly detection doesn’t work if the anomaly is consistent

  • Don’t use a new tech for the first time when you’re already up against a deadline - go with what you know and rebuild later, or pull in knowledge from outside the team