Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Date

Authors

Status

Documenting

Summary

ElasticSearch’s utilization spiked and made it unresponsive to TCS’s requests

Impact

Users cannot use TIS

...

Non-technical Description

ElasticSearch saw a sharp momentary increase in utilization on Prod. TIS could not function properly as a result, as ElasticSearch’s unresponsiveness during the spike made requests timeout.

...

because the backing search database (ElasticSearch) was overloaded.

...

Trigger

  • The ElasticSearch cluster became overloaded.

...

Detection

  • Monitoring message on Slack at 13:57 BST reports failed health check on TCS Blue. TIS becomes unusable.

...

  • : 13:51 BST - CloudWatch shows a spike in memory and CPU utilisation

  • : 13:57 BST - Slack notification about a FAILING Health Check on TCS Prod

  • : 14:00 BST - Identified that TCS’s issue regarded a failing connection to ElasticSearch

  • : 14:01 BST - Users noticed being unable to use TIS, as the main screen keeps updating

  • : 14:15 BST~ish - A security update’s been run as a way to restart the servers (as they clusters can’t be restarted manually)

  • : 14:17 BST - Slack notification about a SUCCESSFUL Health Check on TCS Prod

Root Cause(s)

...

Action Items

Owner

Enable slow logs to figure out faulty requests

John Simmons (Deactivated)

Review cluster configuration (e.g. shards) and consider benchmark testing alternatives.

...

Lessons Learned