2021-02-12 Person Search Sync Failed

Date

Feb 12, 2021

Authors

@Joseph (Pepe) Kelly @Reuben Roberts

Status

Resolved

Summary

Person Search Sync Failed

Impact

Person search page was not showing some data between 07:35 and 08:00

Non-technical Description

The overnight sync procedure for TIS was unable to run. This meant only some trainees were being shown on the person search page.

We increased the resources for the affected component


Trigger

 


Detection

  • The issue was detected when errors were noted in the ‘Sync [Person sync job] started’ Slack notification, and the ‘Sync [Person sync job] finished’ notification failed to appear.

 


Resolution

  • Created a new production elasticsearch cluster to use based on the terraform description (instance_type = t3.medium.elasticsearch up from instance_type = t3.small.elasticsearch)

  • Manually triggered the Person sync job to rebuild the Person elasticsearch index.


Timeline

  • Feb 12, 2021 - 06:20 - Noted Person sync job errors on STAGE and PROD

  • Feb 12, 2021 - 06:29 - Quickest fix (simply re-running the job) observed not to resolve the issue on STAGE

  • Feb 12, 2021 - 07:35 - Question raised by user on Teams

  • Feb 12, 2021 - 07:40-07:50 - Rebuilt the elasticsearch cluster infrastructure as noted in ‘Resolution’ above.

  • Feb 12, 2021 - 07:51-07:58 - Manually re-ran the Sync job; trainees becoming visible during this time.

  • Feb 12, 2021 - 08:00 - Confirmed issue resolved with users

 


Root Cause(s)

  • The nightly sync job failed.

  • There were too many requests to update the index.

  • The Person elasticsearch cluster had been reconfigured on 11 Jan 2021 with t3.small.elasticsearch instances that were too small.

  • The sync service hasn’t been built to respond to “back pressure” or retry failed ‘chunks’ of data.


Action Items

Action Items

Owner

Action Items

Owner

n/a

 


Lessons Learned

  • Manually running the terraform script without going through the normal pull-request and approval process for the TIS-OPS project exposes the infrastructure to a greater risk of being accidentally misconfigured.