Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Date

Authors

Andy Dingley

Status

DocumentingDone

Summary

The TCS service was repeatedly trying and failing to process an event, causing a large volume of additional log entries.

Impact

A significant increase in the monthly costs for logging, usually ~$300/month but currently at ~$1300 part way through the month.

...

The TIS services log a record of their actions to text files which can be used, when needed, to investigate any issues encounter.

A new feature was add added to one of our services, which listens for events occurring in TIS Self-Service (a Conditions of Joining being signed) and updates data within TIS to stay in sync. One of the events sent from our test system could not be processed correctly by TIS. Such cases are not uncommon, the expected behaviour is that the event is sent to a separate queue to be manually reviewed but in this case a configuration issue meant that TIS repeatedly retried instead of rejecting the message.

...

  • : 14:35 - TCS starts logging excessively due to retry loop

  • : 10:30 - Additional CloudWatch costs identified via AWS cost/billing tools

  • : 11:29 - The failing message was moved to a DLQ to stop the logging

  • : 11:01 - Permanent fix deployed to preprod

  • : 15:03 - Permanent fix deploy to prod

  • : ??:?? - Permanent fix deploy to NIMDTA - DEPLOY NOT YET APPROVED

Root Cause(s)

Why was there such excessive logging?

...