In progress

Date	25 Mar 2021
Authors	John Simmons (Deactivated) Doris.Wong Andy Dingley
Status	Done
Summary	Logging has been mis-configured in some services to generate far too many logs
Impact	This uses too much space, and therefore has an impact on cost. Specifically of CloudWatch

Non-technical Description

Logging is a good thing that helps us pinpoint problems occurring with any of our services and therefore helps us fix them as quickly as possible. However, we’re generating too much logging on some of our services resulting in the logs (text files) taking up too much space on CloudWatch and incurring unnecessary costs.

Trigger

What is logging?

Logging is the process of outputting information about events happening within our applications, such as data that is created or updated and what request are being made for that data. It allows us to check the health of the application and gives insight in to what may have caused any errors or issues users experience with the application.

We store text based copies of our logs so that we can refer to them whenever we need to see what happened at a particular point in time.

What do we log?

We always want to log important events like any errors or other unexpected occurrences, we also have more in-depth logging that tells us what is happened in different parts of the application. We don’t need to see that in-depth information all of the time, it is usually turned on if we don’t have enough information about a particular issue and then turned off afterwards. Leaving it on would produce a lot of “noise” and irrelevant information in the logs, taking focus off the more important events, like errors.

What happened?

Some parts of our application were logging a large volume of in-depth information by default. This was caused by a combination of two main issues

Misconfigured logging - output that should have been disabled by default was left enabled.
Badly handled invalid data - we didn’t exclude invalid data, so processing it caused a failure. This data was then retried, repeating the full process continuously (including the misconfigured logged information).

This very quickly amounted to extremely large log files, roughly equivalent to all of the text on Wikipedia every 12 hours!

What was the impact?

We are billed depending on the amount of data we log (and how long we keep it for), this large spike in usage has meant that we have incurred additional and unnecessary costs in our monthly bill.

What have we done to put it right?

Disabled the excessive logging.
Resolved the mishandling of the invalid data.
Set up alerts for future billing spikes.

...

Trigger

tis-revalidation-connections started sending excess amounts of log data to cloudwatch.

...

Detection

Ad-hoc review of CloudWatch.

...

Re-configure logging on several services
Errors in one service of Cannot delete entity with id 'null'. This is because when gmcReferenceNumber of trainee from rabbit is 'null', but it is trying to delete record from Elastic Search.
We can change to not using gmcReferenceNumber as ID as it can be ‘null’
And we should have some code for ‘null’ handling
.
Resolved error caused by null values of gmcReferenceNumber being passed to Elastic Search.
Disabled tis-revalidation-connections cloudwatch logging until log output can be fully reviewed.

...

Timeline

24 Mar 2021: 17:00 - Noticed the over-sized logging when doing an ad-hoc review of CloudWatch.
25 Mar 2021: 10:00 - Raised issue in Stand up where it was classed as a LiveDefect (although no users were affected, it is a current issue with an impact on monthly costs).
25 Mar 2021: 11:00 - Identified fixes that could be put in place within one of the services.
25 Mar 2021: 11:10 - Started implementing the fixes...
25 Mar 2021: 14:12 - Fix deployed moving ElasticSearch query logging to the DEBUG level (and handled null values)
25 Mar 2021: 15:10 - Fix deployed to remove Spring Boot debug logging
26 Mar 2021: 13:00 - Change deployed to disable cloudwatch logging for tis-revalidation-connection

Root Cause(s)

Misconfiguration of logs on some services.

...

Action Items	Owner
Change from using gmcReferenceNumber as ID as it can be ‘null’	Doris.Wong	And we should have some code for ‘null’ handling	Doris.Wong
Monitoring on CloudWatch to send alerts based on log size and/or cost threshold (£11,500 / £12,000)? John Simmons (Deactivated) monthly average + 10%)?	https://hee-tis.atlassian.net/browse/TIS21-1384
Re-enable `tis-revalidation-connections` cloudwatch logging (recreate log groups)	https://hee-tis.atlassian.net/browse/TIS21-1382
Dashboard to display storage quotas (S3, Cloudwatch logs, EFS, EBS)	https://hee-tis.atlassian.net/browse/TIS21-1383

...

Lessons Learned

Exercise caution when configuring logging, and test that the logs generated are logging only what is intended
Potential team sharing on the correlation between logging configuration and log sizes (over time) and costs
Monitoring on CloudWatch would be useful to highlight these kinds of issues before they result in high costs

Versions Compared

Old Version 1

New Version Current

Key

Non-technical Description

Trigger

What is logging?

What do we log?

What happened?

What was the impact?

What have we done to put it right?

Trigger

Detection

Timeline

Root Cause(s)

Lessons Learned

Page Comparison

Versions Compared

Old Version 1

New Version Current

Key

Non-technical Description

Trigger

What is logging?

What do we log?

What happened?

What was the impact?

What have we done to put it right?

Trigger

Detection

Timeline

Root Cause(s)

Lessons Learned