Date
Date |
| |
Authors | Reuben Roberts, Joseph (Pepe) Kelly, John Simmons (Deactivated), Marcello Fabbri (Unlicensed), Doris.Wong, Cai Willis | |
Status | Not resolved | SummaryLiveDefect done. Investigating incomplete due to insufficient logging |
Summary | ElasticSearch’s utilization spiked and made it unresponsive to TCS’s requests | |
Impact | Users cannot couldn’t use TIS for a period of 20mins or so. |
Table of Contents |
---|
Non-technical Description
TIS could not function properly because the backing search database (ElasticSearch) was overloaded.
Logging on Elastic search was not enabled (it is now), and this is where developers would normally go to initially diagnose problems.
In the absence of logging on ES, we checked TCS, bulk upload, Reval and the Sync services (the services that interact with ES), but found nothing that had obviously caused the issue.
The immediate LiveDefect has been resolved, and whilst we haven’t been able to mitigate it recurring, we expect the logging now in place will enable us to identify and resolve any future recurrance.
...
Trigger
The ElasticSearch cluster became overloaded.
...
Detection
Monitoring message on Slack at 13:57 BST reports failed health check on TCS Blue. TIS becomes unusable.
...
Resolution
Running a security update on the ElasticSearch cluster restarted the servers.
...
Timeline
: 13:51 BST - CloudWatch shows a spike in memory and CPU utilisation
: 13:57 BST - Slack notification about a FAILING Health Check on TCS Prod
: 14:00 BST - Identified that TCS’s issue regarded a failing connection to ElasticSearch
: 14:01 BST - Users noticed being unable to use TIS, as the main screen keeps updating
: 14 Jun 2021: 0:15 BST~ish - A security update’s been run as a way to restart the servers (as they clusters can’t be restarted manually)
: 14:17 BST - Slack notification about a SUCCESSFUL Health Check on TCS Prod
Root Cause(s)
Cloudwatch spike in memory and CPU utilisation
WHY #1
Memory spike | CPU spike |
---|---|
| ES being over utilised. Given it normally idles away at around 10-15% it would take something REALLY BIG to push it up to 100% CPU utilisation across 3 nodes. We were having issues with number of servers in a cluster. We settled on 3. Might there be something wrong with the “Elected master”? Any routines that could have been triggered (even by accident) to run a load of data in, that may have screwed up. |
What services are heavily associated with ES? Can we investigate each and discount them as culprits (in the absence of logging)?
|
Without ES Logging, WHYs #2-5 are impossible.
Action Items
Action Items | Owner | Status |
---|---|---|
Enable slow logs to figure out faulty requests | In place | |
Someone investigate any spikes in TCS logging | Nothing significant to report | |
Someone investigate spikes in bulk upload | ||
Someone investigate spikes in sync service | ||
Check whether AWS have notified us of any changes? | TBC |
Possible follow ups
Review cluster configuration (e.g. shards) and consider benchmark testing alternatives | Might be worthwhile (but it could be a bunch of effort - should be coordinated and probably time-boxed). | |
Investigate how to restart clusters without having to do an update? | ||
Apply auto-tuning to the cluster |
...
Lessons Learned
We hardly use ES - should we look at how/whether we use ES going forward? Or is there a more efficient way of handling what ES handles?
When there’s no logging on a service and it fails, it’s hard work doing an RCA!