2021-06-16 ElasticSearch overload
Date | Jun 16, 2021 |
Authors | @Reuben Roberts, @Joseph (Pepe) Kelly, @John Simmons (Deactivated), @Marcello Fabbri (Unlicensed), @Doris.Wong, @Cai Willis |
Status | LiveDefect done. Investigating incomplete due to insufficient logging |
Summary | ElasticSearch’s utilization spiked and made it unresponsive to TCS’s requests |
Impact | Users couldn’t use TIS for a period of 20mins or so. |
Non-technical Description
TIS could not function properly because the backing search database (ElasticSearch) was overloaded.
Logging on Elastic search was not enabled (it is now), and this is where developers would normally go to initially diagnose problems.
In the absence of logging on ES, we checked TCS, bulk upload, Reval and the Sync services (the services that interact with ES), but found nothing that had obviously caused the issue.
The immediate LiveDefect has been resolved, and whilst we haven’t been able to mitigate it recurring, we expect the logging now in place will enable us to identify and resolve any future recurrance.
Trigger
The ElasticSearch cluster became overloaded.
Detection
Monitoring message on Slack at 13:57 BST reports failed health check on TCS Blue. TIS becomes unusable.
Resolution
Running a security update on the ElasticSearch cluster restarted the servers.
Timeline
Jun 16, 2021: 13:51 BST - CloudWatch shows a spike in memory and CPU utilisation
Jun 16, 2021: 13:57 BST - Slack notification about a FAILING Health Check on TCS Prod
Jun 16, 2021: 14:00 BST - Identified that TCS’s issue regarded a failing connection to ElasticSearch
Jun 16, 2021: 14:01 BST - Users noticed being unable to use TIS, as the main screen keeps updating
Jun 16, 2021: 14:15 BST~ish - A security update’s been run as a way to restart the servers (as they clusters can’t be restarted manually)
Jun 16, 2021: 14:17 BST - Slack notification about a SUCCESSFUL Health Check on TCS Prod
Root Cause(s)
Cloudwatch spike in memory and CPU utilisation
WHY #1
Memory spike | CPU spike |
---|---|
| ES being over utilised. Given it normally idles away at around 10-15% it would take something REALLY BIG to push it up to 100% CPU utilisation across 3 nodes. We were having issues with number of servers in a cluster. We settled on 3. Might there be something wrong with the “Elected master”?
Any routines that could have been triggered (even by accident) to run a load of data in, that may have screwed up. |
What services are heavily associated with ES? Can we investigate each and discount them as culprits (in the absence of logging)? TCS - People, Programme Membership, GMC details, Programme [] Reval (when updates to Programme Membership, GMC details, Programme in TIS occur) [Not enough logging to tell whether this might have anything to do with things] Bulk upload - People Sync service |
|
Without ES Logging, WHYs #2-5 are impossible.
Action Items
Action Items | Owner | Status |
---|---|---|
Enable slow logs to figure out faulty requests | @John Simmons (Deactivated) | In place |
Someone investigate any spikes in TCS logging | @Reuben Roberts | Nothing significant to report |
Someone investigate spikes in bulk upload | @Joseph (Pepe) Kelly | |
Someone investigate spikes in sync service | @Marcello Fabbri (Unlicensed) | |
Check whether AWS have notified us of any changes? | @John Simmons (Deactivated) | TBC |
Possible follow ups
Review cluster configuration (e.g. shards) and consider benchmark testing alternatives |
| Might be worthwhile (but it could be a bunch of effort - should be coordinated and probably time-boxed). |
Investigate how to restart clusters without having to do an update? |
|
|
Apply auto-tuning to the cluster |
|
|
Lessons Learned
We hardly use ES - should we look at how/whether we use ES going forward? Or is there a more efficient way of handling what ES handles?
When there’s no logging on a service and it fails, it’s hard work doing an RCA!
Slack: https://hee-nhs-tis.slack.com/
Jira issues: https://hee-tis.atlassian.net/issues/?filter=14213