Date	16 Jun 2021
Authors
Status	Documenting
Summary	ElasticSearch’s utilization spiked and made it unresponsive to TCS’s requests
Impact	Users cannot use TIS

Non-technical Description

ElasticSearch saw a sharp momentary increase in utilization on Prod. TIS could not function properly as a result, as ElasticSearch’s unresponsiveness during the spike made requests timeout.

...

because the backing search database (ElasticSearch) was overloaded.

...

Trigger

The ElasticSearch cluster became overloaded.

...

Detection

Monitoring message on Slack at 13:57 BST reports failed health check on TCS Blue. TIS becomes unusable.

...

16 Jun 2021: 13:51 BST - CloudWatch shows a spike in memory and CPU utilisation
16 Jun 2021: 13:57 BST - Slack notification about a FAILING Health Check on TCS Prod
16 Jun 2021: 14:00 BST - Identified that TCS’s issue regarded a failing connection to ElasticSearch
16 Jun 2021: 14:01 BST - Users noticed being unable to use TIS, as the main screen keeps updating
16 Jun 2021: 14:15 BST~ish - A security update’s been run as a way to restart the servers (as they clusters can’t be restarted manually)
16 Jun 2021: 14:17 BST - Slack notification about a SUCCESSFUL Health Check on TCS Prod

Root Cause(s)

...

Action Items	Owner
Enable slow logs to figure out faulty requests	John Simmons (Deactivated)
Review cluster configuration (e.g. shards) and consider benchmark testing alternatives.

...

Versions Compared

Old Version 9

New Version 10

Key

Non-technical Description

Trigger

Detection

Root Cause(s)

Lessons Learned

Page Comparison

Versions Compared

Old Version 9

New Version 10

Key

Non-technical Description

Trigger

Detection

Root Cause(s)

Lessons Learned