2021-06-14 Person ElasticSearch sync job failed affecting Person Search

Date

Jun 14, 2021

Authors

@Reuben Roberts

Status

LiveDefect done. Investigating incomplete due to insufficient logging

Summary

Also see: TIS21-1667

Impact

Users had an inaccurate list of People on Admins-UI

Non-technical Description

The Person ElasticSearch Sync job failed to run successfully.


Trigger

  • The ElasticSearch cluster that is populated by the Person ElasticSearch sync job failed during the execution of the job.


Detection

  • Monitoring messages posted to Slack:

     


Resolution

  • PersonElasticSearchSync job was rerun.


Timeline

  • Jun 14, 2021: 02:29 BST - Person ElasticSearch Sync Job starts

  • Apr 28, 2021: 02:35 BST - Slack notification reporting the failure of this job

  • Apr 28, 2021: 06:51 BST - Person ElasticSearch Sync Job re-triggered manually

  • Apr 28, 2021: 07:00 BST - Slack notification reporting the failure of this job

  • Apr 28, 2021: 07:33 BST - Person ElasticSearch Sync Job re-triggered manually

  • Apr 28, 2021: 07:44 BST - Slack notification reporting the success of this job

Root Cause(s)

  • One of the three ElasticSearch nodes in the cluster dropped out at 7am BST (and then recovered), which coincided with the failing of the Person ElasticSearch Sync Job at that time:

    Another node seemed to reset coinciding with the 2:30AM BST normal run, although this was not reflected as a drop in the node count at that time:

In both instances, the number of searchable documents after recovery was a fraction of the correct total of ~2 million:

  • The ElasticSearch failures are reflected in the TCS logs, as per:
    BLUE:
    2021-06-14 01:35:14.286 WARN 1 --- [ XNIO-2 task-6] s.b.a.e.ElasticsearchRestHealthIndicator : Elasticsearch health check failed

    java.net.SocketTimeoutException: 30,000 milliseconds timeout on connection http-outgoing-14 [ACTIVE]
    at org.elasticsearch.client.RestClient.extractAndWrapCause(RestClient.java:808)
    at org.elasticsearch.client.RestClient.performRequest(RestClient.java:248)
    at org.elasticsearch.client.RestClient.performRequest(RestClient.java:235)
    at org.springframework.boot.actuate.elasticsearch.ElasticsearchRestHealthIndicator.doHealthCheck(ElasticsearchRestHealthIndicator.java:60)

    GREEN:
    2021-06-14 01:35:21.984 WARN 1 --- [ XNIO-2 task-17] s.b.a.e.ElasticsearchRestHealthIndicator : Elasticsearch health check failed

    java.net.SocketTimeoutException: 30,000 milliseconds timeout on connection http-outgoing-14 [ACTIVE]
    at org.elasticsearch.client.RestClient.extractAndWrapCause(RestClient.java:808)
    at org.elasticsearch.client.RestClient.performRequest(RestClient.java:248)
    at org.elasticsearch.client.RestClient.performRequest(RestClient.java:235)
    at org.springframework.boot.actuate.elasticsearch.ElasticsearchRestHealthIndicator.doHealthCheck(ElasticsearchRestHealthIndicator.java:60)
    at org.springframework.boot.actuate.health.AbstractHealthIndicator.health(AbstractHealthIndicator.java:82)
    at org.springframework.boot.actuate.health.HealthIndicator.getHealth(HealthIndicator.java:37)
    and
    2021-06-14 06:01:21.985 WARN 1 --- [ XNIO-2 task-2] s.b.a.e.ElasticsearchRestHealthIndicator : Elasticsearch health check failed

    java.net.SocketTimeoutException: 30,000 milliseconds timeout on connection http-outgoing-17 [ACTIVE]
    at org.elasticsearch.client.RestClient.extractAndWrapCause(RestClient.java:808)
    at org.elasticsearch.client.RestClient.performRequest(RestClient.java:248)
    at org.elasticsearch.client.RestClient.performRequest(RestClient.java:235)
    at org.springframework.boot.actuate.elasticsearch.ElasticsearchRestHealthIndicator.doHealthCheck(ElasticsearchRestHealthIndicator.java:60)
    at org.springframework.boot.actuate.health.AbstractHealthIndicator.health(AbstractHealthIndicator.java:82)
    at org.springframework.boot.actuate.health.HealthIndicator.getHealth(HealthIndicator.java:37)

  • Unfortunately there is no AWS CloudWatch logging configured for the ElasticSearch cluster. In these circumstances, it is difficult to further identify the root cause of the failure.

Action Items

Action Items

Owner

 

Action Items

Owner

 

Consider setting up error logging on the ElasticSearch cluster.

 

Consider running AutoTune for the ElasticSearch cluster in a time period covering the daily indexing sync jobs.

 

Consider automated temporary upscaling of the ElasticSearch cluster to cover the period when jobs are run to another instance type compatible with t3.medium.elasticsearch

 

 

 

 

 


Lessons Learned

  • ElasticSearch cluster logs would make it easier to assess the root-cause of failure.

  • Default ElasticSearch cluster configuration may not be entirely suitable for the indexing workload of the sync job, though it could be appropriate for general (search-heavy) usage.