Date |
|
Authors | |
Status | Documenting |
Summary | ElasticSearch’s utilization spiked and made it unresponsive to TCS’s requests |
Impact | Users cannot use TIS |
...
Non-technical Description
ElasticSearch saw a sharp momentary increase in utilization on Prod. TIS could not function properly as a result, as ElasticSearch’s unresponsiveness during the spike made requests timeout.
...
because the backing search database (ElasticSearch) was overloaded.
...
Trigger
The ElasticSearch cluster became overloaded.
...
Detection
Monitoring message on Slack at 13:57 BST reports failed health check on TCS Blue. TIS becomes unusable.
...
: 13:51 BST - CloudWatch shows a spike in memory and CPU utilisation
: 13:57 BST - Slack notification about a FAILING Health Check on TCS Prod
: 14:00 BST - Identified that TCS’s issue regarded a failing connection to ElasticSearch
: 14:01 BST - Users noticed being unable to use TIS, as the main screen keeps updating
: 14:15 BST~ish - A security update’s been run as a way to restart the servers (as they clusters can’t be restarted manually)
: 14:17 BST - Slack notification about a SUCCESSFUL Health Check on TCS Prod
Root Cause(s)
...
Action Items | Owner | |
---|---|---|
Enable slow logs to figure out faulty requests | ||
Review cluster configuration (e.g. shards) and consider benchmark testing alternatives. | ||
...