Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Date

Authors

Status

Documenting

Summary

ElasticSearch’s utilization spiked and made it unresponsive to TCS’s requests

Impact

Users cannot use TIS

...

Trigger

...

Detection

...

Resolution

  • Running a security update on the ElasticSearch cluster restarted the servers.

...

Timeline

  • : 13:51 BST - CloudWatch shows a spike in memory and CPU utilisation

  • : 13:57 BST - Slack notification about a FAILING Health Check on TCS Prod

  • : 14:00 BST - Identified that TCS’s issue regarded a failing connection to ElasticSearch

  • : 14:01 BST - Users noticed being unable to use TIS, as the main screen keeps updating

  • : 14:15 BST~ish - A security update’s been run as a way to restart the servers (as they clusters can’t be restarted manually)

  • : 14:17 BST - Slack notification about a SUCCESSFUL Health Check on TCS Prod

Root Cause(s)

...