Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Date

Authors

Status

Documenting

Summary

ElasticSearch’s utilization spiked and made it unresponsive to TCS’s requests

Impact

Users cannot use TIS

...

  • : 13:51 BST - CloudWatch shows a spike in memory and CPU utilisation

  • : 13:57 BST - Slack notification about a FAILING Health Check on TCS Prod

  • : 14:00 BST - Identified that TCS’s issue regarded a failing connection to ElasticSearch

  • : 14:01 BST - Users noticed being unable to use TIS, as the main screen keeps updating

  • : 14:15 BST~ish - A security update’s been run as a way to restart the servers (as they clusters can’t be restarted manually)

  • : 14:17 BST - Slack notification about a SUCCESSFUL Health Check on TCS Prod

Root Cause(s)

...

Action Items

Action Items

Owner

Status

Enable slow logs to figure out faulty requests

John Simmons (Deactivated)Review cluster configuration (e.g. shards) and consider benchmark testing alternatives.

Someone investigate any spikes in TCS logging

Reuben Roberts

Reconvene at 3.30 to share findings

Someone investigate spikes in bulk upload

Joseph (Pepe) Kelly

Someone investigate spikes in sync service

Marcello Fabbri (Unlicensed)

Investigate how to restart clusters without having to do an update?

Possible follow ups

Review cluster configuration (e.g. shards) and consider benchmark testing alternatives

...

Lessons Learned

We hardly use ES - should we look at how/whether we use ES going forward? Or is there a more efficient way of handling what ES handles?

When there’s no logging on a service and it fails, it’s hard work doing an RCA!