Date	Jun 14, 2021
Authors	@Reuben Roberts
Status	LiveDefect done. Investigating incomplete due to insufficient logging
Summary	Also see: TIS21-1667
Impact	Users had an inaccurate list of People on Admins-UI

Non-technical Description

The Person ElasticSearch Sync job failed to run successfully.

Trigger

The ElasticSearch cluster that is populated by the Person ElasticSearch sync job failed during the execution of the job.

Detection

Monitoring messages posted to Slack:

Resolution

PersonElasticSearchSync job was rerun.

Timeline

Jun 14, 2021: 02:29 BST - Person ElasticSearch Sync Job starts
Apr 28, 2021: 02:35 BST - Slack notification reporting the failure of this job
Apr 28, 2021: 06:51 BST - Person ElasticSearch Sync Job re-triggered manually
Apr 28, 2021: 07:00 BST - Slack notification reporting the failure of this job
Apr 28, 2021: 07:33 BST - Person ElasticSearch Sync Job re-triggered manually
Apr 28, 2021: 07:44 BST - Slack notification reporting the success of this job

Root Cause(s)

One of the three ElasticSearch nodes in the cluster dropped out at 7am BST (and then recovered), which coincided with the failing of the Person ElasticSearch Sync Job at that time:

Another node seemed to reset coinciding with the 2:30AM BST normal run, although this was not reflected as a drop in the node count at that time:

In both instances, the number of searchable documents after recovery was a fraction of the correct total of ~2 million:

The ElasticSearch failures are reflected in the TCS logs, as per:
BLUE:
2021-06-14 01:35:14.286 WARN 1 --- [ XNIO-2 task-6] s.b.a.e.ElasticsearchRestHealthIndicator : Elasticsearch health check failed
java.net.SocketTimeoutException: 30,000 milliseconds timeout on connection http-outgoing-14 [ACTIVE]
at org.elasticsearch.client.RestClient.extractAndWrapCause(RestClient.java:808)
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:248)
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:235)
at org.springframework.boot.actuate.elasticsearch.ElasticsearchRestHealthIndicator.doHealthCheck(ElasticsearchRestHealthIndicator.java:60)

GREEN:
2021-06-14 01:35:21.984 WARN 1 --- [ XNIO-2 task-17] s.b.a.e.ElasticsearchRestHealthIndicator : Elasticsearch health check failed
java.net.SocketTimeoutException: 30,000 milliseconds timeout on connection http-outgoing-14 [ACTIVE]
at org.elasticsearch.client.RestClient.extractAndWrapCause(RestClient.java:808)
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:248)
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:235)
at org.springframework.boot.actuate.elasticsearch.ElasticsearchRestHealthIndicator.doHealthCheck(ElasticsearchRestHealthIndicator.java:60)
at org.springframework.boot.actuate.health.AbstractHealthIndicator.health(AbstractHealthIndicator.java:82)
at org.springframework.boot.actuate.health.HealthIndicator.getHealth(HealthIndicator.java:37)
and
2021-06-14 06:01:21.985 WARN 1 --- [ XNIO-2 task-2] s.b.a.e.ElasticsearchRestHealthIndicator : Elasticsearch health check failed
java.net.SocketTimeoutException: 30,000 milliseconds timeout on connection http-outgoing-17 [ACTIVE]
at org.elasticsearch.client.RestClient.extractAndWrapCause(RestClient.java:808)
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:248)
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:235)
at org.springframework.boot.actuate.elasticsearch.ElasticsearchRestHealthIndicator.doHealthCheck(ElasticsearchRestHealthIndicator.java:60)
at org.springframework.boot.actuate.health.AbstractHealthIndicator.health(AbstractHealthIndicator.java:82)
at org.springframework.boot.actuate.health.HealthIndicator.getHealth(HealthIndicator.java:37)
Unfortunately there is no AWS CloudWatch logging configured for the ElasticSearch cluster. In these circumstances, it is difficult to further identify the root cause of the failure.

Action Items

Action Items	Owner

Action Items	Owner
Consider setting up error logging on the ElasticSearch cluster.
Consider running AutoTune for the ElasticSearch cluster in a time period covering the daily indexing sync jobs.
Consider automated temporary upscaling of the ElasticSearch cluster to cover the period when jobs are run to another instance type compatible with t3.medium.elasticsearch

Lessons Learned

ElasticSearch cluster logs would make it easier to assess the root-cause of failure.
Default ElasticSearch cluster configuration may not be entirely suitable for the indexing workload of the sync job, though it could be appropriate for general (search-heavy) usage.

2021-06-14 Person ElasticSearch sync job failed affecting Person Search

Non-technical Description

Trigger

Detection

Resolution

Timeline

Root Cause(s)

Action Items

Lessons Learned