Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Date

Authors

Reuben Roberts, Joseph (Pepe) Kelly, John Simmons (Deactivated), Marcello Fabbri (Unlicensed), Doris.Wong, Cai Willis

Status

Investigating / Documenting

Summary

ElasticSearch’s utilization spiked and made it unresponsive to TCS’s requests

Impact

Users couldn’t use TIS for a period of 20mins or so.

...

  • : 13:51 BST - CloudWatch shows a spike in memory and CPU utilisation

  • : 13:57 BST - Slack notification about a FAILING Health Check on TCS Prod

  • : 14:00 BST - Identified that TCS’s issue regarded a failing connection to ElasticSearch

  • : 14:01 BST - Users noticed being unable to use TIS, as the main screen keeps updating

  • : 14:15 BST~ish - A security update’s been run as a way to restart the servers (as they clusters can’t be restarted manually)

  • : 14:17 BST - Slack notification about a SUCCESSFUL Health Check on TCS Prod

Root Cause(s)

...