2021-06-14 Person ElasticSearch sync job failed affecting Person Search
Date | Jun 14, 2021 |
Authors | @Reuben Roberts |
Status | LiveDefect done. Investigating incomplete due to insufficient logging |
Summary | Also see: TIS21-1667 |
Impact | Users had an inaccurate list of People on Admins-UI |
Non-technical Description
The Person ElasticSearch Sync job failed to run successfully.
Trigger
The ElasticSearch cluster that is populated by the Person ElasticSearch sync job failed during the execution of the job.
Detection
Monitoring messages posted to Slack:
Resolution
PersonElasticSearchSync job was rerun.
Timeline
Jun 14, 2021: 02:29 BST - Person ElasticSearch Sync Job starts
Apr 28, 2021: 02:35 BST - Slack notification reporting the failure of this job
Apr 28, 2021: 06:51 BST - Person ElasticSearch Sync Job re-triggered manually
Apr 28, 2021: 07:00 BST - Slack notification reporting the failure of this job
Apr 28, 2021: 07:33 BST - Person ElasticSearch Sync Job re-triggered manually
Apr 28, 2021: 07:44 BST - Slack notification reporting the success of this job
Root Cause(s)
One of the three ElasticSearch nodes in the cluster dropped out at 7am BST (and then recovered), which coincided with the failing of the Person ElasticSearch Sync Job at that time:
Another node seemed to reset coinciding with the 2:30AM BST normal run, although this was not reflected as a drop in the node count at that time:
In both instances, the number of searchable documents after recovery was a fraction of the correct total of ~2 million:
The ElasticSearch failures are reflected in the TCS logs, as per:
BLUE:2021-06-14 01:35:14.286 WARN 1 --- [ XNIO-2 task-6] s.b.a.e.ElasticsearchRestHealthIndicator : Elasticsearch health check failed
java.net.SocketTimeoutException: 30,000 milliseconds timeout on connection http-outgoing-14 [ACTIVE]
at org.elasticsearch.client.RestClient.extractAndWrapCause(RestClient.java:808)
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:248)
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:235)
at org.springframework.boot.actuate.elasticsearch.ElasticsearchRestHealthIndicator.doHealthCheck(ElasticsearchRestHealthIndicator.java:60)
GREEN:2021-06-14 01:35:21.984 WARN 1 --- [ XNIO-2 task-17] s.b.a.e.ElasticsearchRestHealthIndicator : Elasticsearch health check failed
java.net.SocketTimeoutException: 30,000 milliseconds timeout on connection http-outgoing-14 [ACTIVE]
at org.elasticsearch.client.RestClient.extractAndWrapCause(RestClient.java:808)
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:248)
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:235)
at org.springframework.boot.actuate.elasticsearch.ElasticsearchRestHealthIndicator.doHealthCheck(ElasticsearchRestHealthIndicator.java:60)
at org.springframework.boot.actuate.health.AbstractHealthIndicator.health(AbstractHealthIndicator.java:82)
at org.springframework.boot.actuate.health.HealthIndicator.getHealth(HealthIndicator.java:37)
and2021-06-14 06:01:21.985 WARN 1 --- [ XNIO-2 task-2] s.b.a.e.ElasticsearchRestHealthIndicator : Elasticsearch health check failed
java.net.SocketTimeoutException: 30,000 milliseconds timeout on connection http-outgoing-17 [ACTIVE]
at org.elasticsearch.client.RestClient.extractAndWrapCause(RestClient.java:808)
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:248)
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:235)
at org.springframework.boot.actuate.elasticsearch.ElasticsearchRestHealthIndicator.doHealthCheck(ElasticsearchRestHealthIndicator.java:60)
at org.springframework.boot.actuate.health.AbstractHealthIndicator.health(AbstractHealthIndicator.java:82)
at org.springframework.boot.actuate.health.HealthIndicator.getHealth(HealthIndicator.java:37)
Unfortunately there is no AWS CloudWatch logging configured for the ElasticSearch cluster. In these circumstances, it is difficult to further identify the root cause of the failure.
Action Items
Action Items | Owner |
|
---|---|---|
Consider setting up error logging on the ElasticSearch cluster. |
|
|
Consider running AutoTune for the ElasticSearch cluster in a time period covering the daily indexing sync jobs. |
|
|
Consider automated temporary upscaling of the ElasticSearch cluster to cover the period when jobs are run to another instance type compatible with t3.medium.elasticsearch |
|
|
|
|
|
Lessons Learned
ElasticSearch cluster logs would make it easier to assess the root-cause of failure.
Default ElasticSearch cluster configuration may not be entirely suitable for the indexing workload of the sync job, though it could be appropriate for general (search-heavy) usage.
Slack: https://hee-nhs-tis.slack.com/
Jira issues: https://hee-tis.atlassian.net/issues/?filter=14213