Date |
|
Authors | |
Status | Documenting |
Summary | ElasticSearch’s utilization spiked and made it unresponsive to TCS’s requests |
Impact | Users cannot use TIS |
...
: 13:51 BST - CloudWatch shows a spike in memory and CPU utilisation
: 13:57 BST - Slack notification about a FAILING Health Check on TCS Prod
: 14:00 BST - Identified that TCS’s issue regarded a failing connection to ElasticSearch
: 14:01 BST - Users noticed being unable to use TIS, as the main screen keeps updating
: 14:15 BST~ish - A security update’s been run as a way to restart the servers (as they clusters can’t be restarted manually)
: 14:17 BST - Slack notification about a SUCCESSFUL Health Check on TCS Prod
Root Cause(s)
...
Action Items
Action Items | Owner | Status |
---|---|---|
Enable slow logs to figure out faulty requests | John Simmons (Deactivated)Review cluster configuration (e.g. shards) and consider benchmark testing alternatives. | |
Someone investigate any spikes in TCS logging | Reconvene at 3.30 to share findings | |
Someone investigate spikes in bulk upload | ||
Someone investigate spikes in sync service | ||
Investigate how to restart clusters without having to do an update? |
Possible follow ups
Review cluster configuration (e.g. shards) and consider benchmark testing alternatives |
...
Lessons Learned
We hardly use ES - should we look at how/whether we use ES going forward? Or is there a more efficient way of handling what ES handles?
When there’s no logging on a service and it fails, it’s hard work doing an RCA!