Documenting

Date	16 Jun 2021
Authors	Status	Reuben Roberts, Joseph (Pepe) Kelly, John Simmons (Deactivated), Marcello Fabbri (Unlicensed), Doris.Wong, Cai Willis
Status	LiveDefect done. Investigating incomplete due to insufficient logging
Summary	ElasticSearch’s utilization spiked and made it unresponsive to TCS’s requests
Impact	Users cannot couldn’t use TIS for a period of 20mins or so.

Table of Contents

Non-technical Description

ElasticSearch saw a sharp momentary increase in utilization on Prod. TIS could not function properly as a result, as ElasticSearch’s unresponsiveness during the spike made requests timeout.

Trigger

...

because the backing search database (ElasticSearch) was overloaded.
Logging on Elastic search was not enabled (it is now), and this is where developers would normally go to initially diagnose problems.
In the absence of logging on ES, we checked TCS, bulk upload, Reval and the Sync services (the services that interact with ES), but found nothing that had obviously caused the issue.

The immediate LiveDefect has been resolved, and whilst we haven’t been able to mitigate it recurring, we expect the logging now in place will enable us to identify and resolve any future recurrance.

...

Trigger

The ElasticSearch cluster became overloaded.

...

Detection

Monitoring message on Slack at 13:57 BST reports failed health check on TCS Blue. TIS becomes unusable.

...

Resolution

Running a security update on the ElasticSearch cluster restarted the servers.

...

16 Jun 2021: 13:51 BST - CloudWatch shows a spike in memory and CPU utilisation
16 Jun 2021: 13:57 BST - Slack notification about a FAILING Health Check on TCS Prod
16 Jun 2021: 14:00 BST - Identified that TCS’s issue regarded a failing connection to ElasticSearch
16 Jun 2021: 14:01 BST - Users noticed being unable to use TIS, as the main screen keeps updating
16 Jun 2021: 14:15 BST~ish - A security update’s been run as a way to restart the servers (as they clusters can’t be restarted manually)
16 Jun 2021: 14:17 BST - Slack notification about a SUCCESSFUL Health Check on TCS Prod

Root Cause(s)

Cloudwatch spike in memory and CPU utilisation

WHY #1

Memory spike	CPU spike
Memory usage (JVM pressure) increased steadily during the morning, reaching 75% at 12:30 UTC: Image Modified Once JVM memory pressure reached 75%, then Amazon ES triggered the Concurrent Mark Sweep (CMS) garbage collector. Some memory was reclaimed, but JVM memory pressure again reached 75% at 13:05 UTC, triggering another garbage collection.

...

Garbage collection is a CPU-intensive process, pushing CPU utilisation to 100% between 12:50 - 13:15 UTC.
It is possible that during this period the cluster was encountering ClusterBlockException and/or JVM OutOfMemoryError; there were definitely cluster performance issues (as per https://aws.amazon.com/premiumsupport/knowledge-center/high-jvm-memory-pressure-elasticsearch/). Error logging has now been enabled on the cluster to provide this level of detail in future.
There are a range of possible reasons for the steady increase in JVM memory pressure. In a general sense, the cluster may be configured sub-optimally. In particular, the number of shards (5) may be too high for the persons / masterdoctorindex indices, given these both comprise less than 300mb total size. AWS recommends shard size between 10–50 GiB as “too many small shards can cause performance issues and out of memory errors”. Elastic recommends “average shard size between at least a few GB and a few tens of GB”. Benchmarking performance with fewer shards could confirm whether redesigning the cluster would be advantageous.
During normal usage, the ElasticSearch cluster shows gradual increases in JVM memory pressure, followed by garbage reclamation, in the normal saw-tooth pattern. As JVM memory pressure is a measure of the fill rate of the old generation pool, this reflects the accumulation of long-lived objects (e.g. cached searches?) in memory, and is not inherently problematic.
Nightly sync runs cause a spike in indexing activity, and JVM memory pressure commonly increases at that point; however, this incident occurred outside of that timeframe. There were no apparent significant sync jobs being run at the time of cluster failure. Given that JVM memory pressure had been steadily increasing during the course of the morning, it is more likely that slightly higher than normal usage, or unusually diverse searches, were causing an accumulation of long-lived objects. (Would Google Analytics confirm this?)

ES being over utilised.

Image Added

Given it normally idles away at around 10-15% it would take something REALLY BIG to push it up to 100% CPU utilisation across 3 nodes.

We were having issues with number of servers in a cluster. We settled on 3.

Might there be something wrong with the “Elected master”?

Any routines that could have been triggered (even by accident) to run a load of data in, that may have screwed up.

What services are heavily associated with ES? Can we investigate each and discount them as culprits (in the absence of logging)?

TCS - People, Programme Membership, GMC details, Programme []
Reval (when updates to Programme Membership, GMC details, Programme in TIS occur) [Not enough logging to tell whether this might have anything to do with things]
Bulk upload - People
Sync service

Without ES Logging, WHYs #2-5 are impossible.

Action Items

Action Items	Owner	Status
Enable slow logs to figure out faulty requests	John Simmons (Deactivated)	In place
Someone investigate any spikes in TCS logging	Reuben Roberts	Nothing significant to report
Someone investigate spikes in bulk upload	Joseph (Pepe) Kelly
Someone investigate spikes in sync service	Marcello Fabbri (Unlicensed)
Check whether AWS have notified us of any changes?	John Simmons (Deactivated)	TBC

Possible follow ups

Review cluster configuration (e.g. shards) and consider benchmark testing alternatives		Might be worthwhile (but it could be a bunch of effort - should be coordinated and probably time-boxed).
Investigate how to restart clusters without having to do an update?
Apply auto-tuning to the cluster

...

Lessons Learned

We hardly use ES - should we look at how/whether we use ES going forward? Or is there a more efficient way of handling what ES handles?

When there’s no logging on a service and it fails, it’s hard work doing an RCA!

Versions Compared

Old Version 7

New Version Current

Key

Non-technical Description

Trigger

Trigger

Detection

Resolution

Root Cause(s)

Action Items

Lessons Learned

Page Comparison

Versions Compared

Old Version 7

New Version Current

Key

Non-technical Description

Trigger

Trigger

Detection

Resolution

Root Cause(s)

Action Items

Lessons Learned