Date |
|
Authors |
|
Status | Documenting |
Summary | ElasticSearch’s utilization spiked and made it unresponsive to TCS’s requests |
Impact | Users cannot use TIS |
Non-technical Description
ElasticSearch saw a sharp momentary increase in utilization on Prod. TIS could not function properly as a result, as ElasticSearch’s unresponsiveness during the spike made requests timeout.
Trigger
Detection
Resolution
Running a security update on the ElasticSearch cluster restarted the servers.
Timeline
: 13:51 BST - CloudWatch shows a spike in memory and CPU utilisation
: 13:57 BST - Slack notification about a FAILING Health Check on TCS Prod
: 14:00 BST - Identified that TCS’s issue regarded a failing connection to ElasticSearch
: 14:01 BST - Users noticed being unable to use TIS, as the main screen keeps updating
: 14:15 BST~ish - A security update’s been run as a way to restart the servers (as they clusters can’t be restarted manually)
: 14:17 BST - Slack notification about a SUCCESSFUL Health Check on TCS Prod
Root Cause(s)
Action Items
Action Items | Owner | |
---|---|---|
| ||
| ||
Add Comment