Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 6 Next »

Date

Authors

Status

Documenting

Summary

ElasticSearch’s utilization spiked and made it unresponsive to TCS’s requests

Impact

Users cannot use TIS

Non-technical Description

ElasticSearch saw a sharp momentary increase in utilization on Prod. TIS could not function properly as a result, as ElasticSearch’s unresponsiveness during the spike made requests timeout.


Trigger


Detection


Resolution

  • Running a security update on the ElasticSearch cluster restarted the servers.


Timeline

  • : 13:51 BST - CloudWatch shows a spike in memory and CPU utilisation

  • : 13:57 BST - Slack notification about a FAILING Health Check on TCS Prod

  • : 14:00 BST - Identified that TCS’s issue regarded a failing connection to ElasticSearch

  • : 14:01 BST - Users noticed being unable to use TIS, as the main screen keeps updating

  • : 14:15 BST~ish - A security update’s been run as a way to restart the servers (as they clusters can’t be restarted manually)

  • : 14:17 BST - Slack notification about a SUCCESSFUL Health Check on TCS Prod

Root Cause(s)


Action Items

Action Items

Owner


Lessons Learned

  • No labels