Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Date

Authors

Joseph (Pepe) Kelly Reuben Roberts

Status

Resolved

Summary

Person Search Sync Failed

Impact

Person search page was not showing some data between 07:35 and 08:00

...

The overnight sync procedure for TIS was unable to run. This meant only some trainees were being shown on the person search page.

...

Trigger

...

...

Detection

  • Detected when slack notifications in the #monitoring-prod channel The issue was detected when errors were noted in the ‘Sync [Person sync job] started’ Slack notification, and the ‘Sync [Person sync job] finished’ notification failed to appear.

...

...

Resolution

  • Created a new production elasticsearch cluster to use based on the terraform description The sync jobs run fine when manually triggered however this is a temporary solution(instance_type = t3.medium.elasticsearch instead of the incorrectly set instance_type = t3.small.elasticsearch)

  • Manually triggered the Person sync job to rebuild the Person elasticsearch index.

...

Timeline

  • - Question raised by user on Teams06:20 - Noted Person sync job errors on STAGE and PROD

  • - 06:29 - Quickest fix (simply re-running the job) didn’t work on STAGEobserved not to resolve the issue on STAGE

  • - 07:35 - Question raised by user on Teams

  • - We rebuilt the infrastructure similarly to NIMDTA07:40-07:50 - Rebuilt the elasticsearch cluster infrastructure as noted in ‘Resolution’ above.

  • - 07:51-07:58 - Ran Manually re-ran the Sync job again; trainees becoming visible during this time.

  • - 08:00 - confirmed Confirmed issue resolved with users fixed

...

...

Root Cause(s)

  • The Person elasticsearch cluster had been incorrectly reconfigured on 11 Jan 2021 to use t3.small.elasticsearch instances. This caused the sync job to fail.

...

Action Items

Action Items

Owner

Mn/a

...

Lessons Learned

  • TeManually running the terraform script without going through the normal pull-request and approval process for the TIS-OPS project exposes the infrastructure to a greater risk of being accidentally misconfigured.