Date |
|
Authors | Reuben Roberts, Joseph (Pepe) Kelly, John Simmons (Deactivated), Marcello Fabbri (Unlicensed), Doris.Wong, Cai Willis |
Status | DocumentingLiveDefect done. Investigating incomplete due to insufficient logging |
Summary | ElasticSearch’s utilization spiked and made it unresponsive to TCS’s requests |
Impact | Users couldn’t use TIS for a period of 20mins or so. |
...
TIS could not function properly because the backing search database (ElasticSearch) was overloaded.
Logging on Elastic search was not enabled (it is now), and this is where developers would normally go to initially diagnose problems.
In the absence of logging on ES, we checked TCS, bulk upload, Reval and the Sync services (the services that interact with ES), but found nothing that had obviously caused the issue.
The immediate LiveDefect has been resolved, and whilst we haven’t been able to mitigate it recurring, we expect the logging now in place will enable us to identify and resolve any future recurrance.
...
Trigger
The ElasticSearch cluster became overloaded.
...
: 13:51 BST - CloudWatch shows a spike in memory and CPU utilisation
: 13:57 BST - Slack notification about a FAILING Health Check on TCS Prod
: 14:00 BST - Identified that TCS’s issue regarded a failing connection to ElasticSearch
: 14:01 BST - Users noticed being unable to use TIS, as the main screen keeps updating
: 14:15 BST~ish - A security update’s been run as a way to restart the servers (as they clusters can’t be restarted manually)
: 14:17 BST - Slack notification about a SUCCESSFUL Health Check on TCS Prod
Root Cause(s)
...
Memory spike | CPU spike |
---|---|
| ES being over utilised. Given it normally idles away at around 10-15% it would take something REALLY BIG to push it up to 100% CPU utilisation across 3 nodes. We were having issues with number of servers in a cluster. We settled on 3. Might there be something wrong with the “Elected master”? Any routines that could have been triggered (even by accident) to run a load of data in, that may have screwed up. |
What services are heavily associated with ES? Can we investigate each and discount them as culprits (in the absence of logging)?
|
Without ES Logging, WHYs #2-5 are impossible.
Action Items
Action Items | Owner | Status | |
---|---|---|---|
Enable slow logs to figure out faulty requests | In place | ||
Someone investigate any spikes in TCS logging | Reconvene at 3.30 to share findings | Nothing significant to report | |
Someone investigate spikes in bulk upload | |||
Someone investigate spikes in sync service | |||
Investigate how to restart clusters without having to do an update? | Check whether AWS have notified us of any changes? | TBC |
Possible follow ups
Review cluster configuration (e.g. shards) and consider benchmark testing alternatives | Might be worthwhile (but it could be a bunch of effort - should be coordinated and probably time-boxed). | |
Investigate how to restart clusters without having to do an update? | ||
Apply auto-tuning to the cluster |
...
Lessons Learned
We hardly use ES - should we look at how/whether we use ES going forward? Or is there a more efficient way of handling what ES handles?
...