Date |
|
Authors | |
Status | Resolved |
Summary | Person Search Sync Failed |
Impact | Person search page was not showing some data between 07:35 and 08:00 |
...
The overnight sync procedure for TIS was unable to run. This meant only some trainees were being shown on the person search page.
...
Trigger
...
...
Detection
Detected when slack notifications in the #monitoring-prod channel The issue was detected when errors were noted in the ‘Sync [Person sync job] started’ Slack notification, and the ‘Sync [Person sync job] finished’ notification failed to appear.
...
...
Resolution
Created a new production elasticsearch cluster to use based on the terraform description The sync jobs run fine when manually triggered however this is a temporary solution(
instance_type = t3.medium.elasticsearch
instead of the incorrectly setinstance_type = t3.small.elasticsearch
)Manually triggered the Person sync job to rebuild the Person elasticsearch index.
...
Timeline
- Question raised by user on Teams06:20 - Noted Person sync job errors on STAGE and PROD
- 06:29 - Quickest fix (simply re-running the job) didn’t work on STAGEobserved not to resolve the issue on STAGE
- 07:35 - Question raised by user on Teams
- We rebuilt the infrastructure similarly to NIMDTA07:40-07:50 - Rebuilt the elasticsearch cluster infrastructure as noted in ‘Resolution’ above.
- 07:51-07:58 - Ran Manually re-ran the Sync job again; trainees becoming visible during this time.
- 08:00 - confirmed Confirmed issue resolved with users fixed
...
...
Root Cause(s)
The Person elasticsearch cluster had been incorrectly reconfigured on 11 Jan 2021 to use
t3.small.elasticsearch
instances. This caused the sync job to fail.
...
Action Items
Action Items | Owner |
---|---|
Mn/a |
...
Lessons Learned
TeManually running the terraform script without going through the normal pull-request and approval process for the TIS-OPS project exposes the infrastructure to a greater risk of being accidentally misconfigured.