Date	16 Jun 2021
Authors
Status	Documenting
Summary	ElasticSearch’s utilization spiked and made it unresponsive to TCS’s requests
Impact	Users cannot use TIS

Non-technical Description

ElasticSearch saw a sharp momentary increase in utilization on Prod. TIS could not function properly as a result, as ElasticSearch’s unresponsiveness during the spike made requests timeout.

Trigger

Detection

Resolution

Running a security update on the ElasticSearch cluster restarted the servers.

Timeline

16 Jun 2021: 13:51 BST - CloudWatch shows a spike in memory and CPU utilisation
16 Jun 2021: 13:57 BST - Slack notification about a FAILING Health Check on TCS Prod
16 Jun 2021: 14:00 BST - Identified that TCS’s issue regarded a failing connection to ElasticSearch
16 Jun 2021: 14:01 BST - Users noticed being unable to use TIS, as the main screen keeps updating
16 Jun 2021: 14:15 BST~ish - A security update’s been run as a way to restart the servers (as they clusters can’t be restarted manually)
16 Jun 2021: 14:17 BST - Slack notification about a SUCCESSFUL Health Check on TCS Prod

Root Cause(s)

Memory usage (JVM pressure) increased steadily during the morning, reaching 75% at 12:30 UTC:
Once JVM memory pressure reached 75%, then Amazon ES triggered the Concurrent Mark Sweep (CMS) garbage collector. Some memory was reclaimed, but JVM memory pressure again reached 75% at 13:05 UTC, triggering another garbage collection. The garbage collection is a CPU-intensive process, pushing CPU utilisation to 100% between 12:50 - 13:15 UTC.
It is possible that during this period the cluster was encountering ClusterBlockException and/or JVM OutOfMemoryError; there were definitely cluster performance issues (as per https://aws.amazon.com/premiumsupport/knowledge-center/high-jvm-memory-pressure-elasticsearch/). Error logging has now been enabled on the cluster to provide this level of detail in future.
There are a range of possible reasons for the steady increase in JVM memory pressure. In a general sense, the cluster may be configured sub-optimally. In particular, the number of shards (5) may be too high for the persons / masterdoctorindex indices, given these both comprise less than 300mb total size. AWS recommends shard size between 10–50 GiB as “too many small shards can cause performance issues and out of memory errors”. Elastic recommends “average shard size between at least a few GB and a few tens of GB”. Benchmarking performance with fewer shards could confirm whether redesigning the cluster would be advantageous.
During normal usage, the ElasticSearch cluster shows gradual increases in JVM memory pressure, followed by garbage reclamation, in the normal saw-tooth pattern. As JVM memory pressure is a measure of the fill rate of the old generation pool, this reflects the accumulation of long-lived objects (e.g. cached searches) in memory, and is not necessarily problematic.

Action Items

Action Items	Owner

2021-06-16 ElasticSearch overload

Non-technical Description

Trigger

Detection

Resolution

Timeline

Root Cause(s)

Action Items

Lessons Learned