Date	16 Jun 2021
Authors
Status	Documenting
Summary	ElasticSearch’s utilization spiked and made it unresponsive to TCS’s requests
Impact	Users cannot use TIS

16 Jun 2021: 13:51 BST - CloudWatch shows a spike in memory and CPU utilisation
16 Jun 2021: 13:57 BST - Slack notification about a FAILING Health Check on TCS Prod
16 Jun 2021: 14:00 BST - Identified that TCS’s issue regarded a failing connection to ElasticSearch
16 Jun 2021: 14:01 BST - Users noticed being unable to use TIS, as the main screen keeps updating
16 Jun 2021: 14:15 BST~ish - A security update’s been run as a way to restart the servers (as they clusters can’t be restarted manually)
16 Jun 2021: 14:17 BST - Slack notification about a SUCCESSFUL Health Check on TCS Prod

Root Cause(s)

...

Action Items

Action Items	Owner	Status
Enable slow logs to figure out faulty requests	John Simmons (Deactivated)Review cluster configuration (e.g. shards) and consider benchmark testing alternatives.
Someone investigate any spikes in TCS logging	Reuben Roberts	Reconvene at 3.30 to share findings
Someone investigate spikes in bulk upload	Joseph (Pepe) Kelly
Someone investigate spikes in sync service	Marcello Fabbri (Unlicensed)
Investigate how to restart clusters without having to do an update?

Possible follow ups

Review cluster configuration (e.g. shards) and consider benchmark testing alternatives

...

We hardly use ES - should we look at how/whether we use ES going forward? Or is there a more efficient way of handling what ES handles?

When there’s no logging on a service and it fails, it’s hard work doing an RCA!