Date |
|
Authors | Reuben Roberts, Joseph (Pepe) Kelly, John Simmons (Deactivated), Marcello Fabbri (Unlicensed), Doris.Wong, Cai Willis |
Status | Investigating / Documenting |
Summary | ElasticSearch’s utilization spiked and made it unresponsive to TCS’s requests |
Impact | Users couldn’t use TIS for a period of 20mins or so. |
Non-technical Description
TIS could not function properly because the backing search database (ElasticSearch) was overloaded.
Trigger
The ElasticSearch cluster became overloaded.
Detection
Monitoring message on Slack at 13:57 BST reports failed health check on TCS Blue. TIS becomes unusable.
Resolution
Running a security update on the ElasticSearch cluster restarted the servers.
Timeline
: 13:51 BST - CloudWatch shows a spike in memory and CPU utilisation
: 13:57 BST - Slack notification about a FAILING Health Check on TCS Prod
: 14:00 BST - Identified that TCS’s issue regarded a failing connection to ElasticSearch
: 14:01 BST - Users noticed being unable to use TIS, as the main screen keeps updating
: 14:15 BST~ish - A security update’s been run as a way to restart the servers (as they clusters can’t be restarted manually)
: 14:17 BST - Slack notification about a SUCCESSFUL Health Check on TCS Prod
Root Cause(s)
Cloudwatch spike in memory and CPU utilisation
WHY #1
Memory spike | CPU spike |
---|---|
| ES being over utilised. Given it normally idles away at around 10-15% it would take something REALLY BIG to push it up to 100% CPU utilisation across 3 nodes. We were having issues with number of servers in a cluster. We settled on 3. Might there be something wrong with the “Elected master”? Any routines that could have been triggered (even by accident) to run a load of data in, that may have screwed up. |
What services are heavily associated with ES? Can we investigate each and discount them as culprits (in the absence of logging)?
|
Action Items
Action Items | Owner | Status |
---|---|---|
Enable slow logs to figure out faulty requests |
| |
Someone investigate any spikes in TCS logging | Reconvene at 3.30 to share findings | |
Someone investigate spikes in bulk upload | ||
Someone investigate spikes in sync service | ||
Investigate how to restart clusters without having to do an update? |
Possible follow ups
Review cluster configuration (e.g. shards) and consider benchmark testing alternatives |
Lessons Learned
We hardly use ES - should we look at how/whether we use ES going forward? Or is there a more efficient way of handling what ES handles?
When there’s no logging on a service and it fails, it’s hard work doing an RCA!
Add Comment