Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The overnight sync procedure for TIS was unable to run. This meant only some trainees were being shown on the person search page.

We increased the resources for the affected component

...

Trigger

...

...

Detection

  • The issue was detected when errors were noted in the ‘Sync [Person sync job] started’ Slack notification, and the ‘Sync [Person sync job] finished’ notification failed to appear.

...

  • Created a new production elasticsearch cluster to use based on the terraform description (instance_type = t3.medium.elasticsearch instead of the incorrectly set up from instance_type = t3.small.elasticsearch)

  • Manually triggered the Person sync job to rebuild the Person elasticsearch index.

...

  • - 06:20 - Noted Person sync job errors on STAGE and PROD

  • - 06:29 - Quickest fix (simply re-running the job) observed not to resolve the issue on STAGE

  • - 07:35 - Question raised by user on Teams

  • - 07:40-07:50 - Rebuilt the elasticsearch cluster infrastructure as noted in ‘Resolution’ above.

  • - 07:51-07:58 - Manually re-ran the Sync job; trainees becoming visible during this time.

  • - 08:00 - Confirmed issue resolved with users

...

...

Root Cause(s)

  • The nightly sync job failed.

  • There were too many requests to update the index.

  • The Person elasticsearch cluster had been incorrectly reconfigured on 11 Jan 2021 to use with t3.small.elasticsearch instances . This caused the sync job to failthat were too small.

  • The sync service hasn’t been built to respond to “back pressure” or retry failed ‘chunks’ of data.

...

Action Items

Action Items

Owner

n/a

...