Date |
|
Authors | |
Status | ResolvedIn Progress |
Summary | ESR sent loads of files and it looks like we haven’t captured everything from |
Impact |
...
Non-technical Description
Trigger
Detection
The issue was detected when errors were noted in the ‘Sync [Person sync job] started’ Slack notification, and the ‘Sync [Person sync job] finished’ notification failed to appear.
...
Resolution
Created a new production elasticsearch cluster to use based on the terraform description (
instance_type = t3.medium.elasticsearch
up frominstance_type = t3.small.elasticsearch
)Manually triggered the Person sync job to rebuild the Person elasticsearch index.
Timeline
- 06:20 - Noted Person sync job errors on STAGE and PROD
- 06:29 - Quickest fix (simply re-running the job) observed not to resolve the issue on STAGE
- 07:35 - Question raised by user on Teams
- 07:40-07:50 - Rebuilt the elasticsearch cluster infrastructure as noted in ‘Resolution’ above.
- 07:51-07:58 - Manually re-ran the Sync job; trainees becoming visible during this time.
- 08:00 - Confirmed issue resolved with users
...
Root Cause(s)
The nightly sync job failed.
There were too many requests to update the index.
The Person elasticsearch cluster had been reconfigured on 11 Jan 2021 with
t3.small.elasticsearch
instances that were too small.The sync service hasn’t been built to respond to “back pressure” or retry failed ‘chunks’ of data.
ESR sent through a number of FULL FILES on 1st of March that did not load or fully load/reconcile to then send applicants against subsequently for St Helens & Knowsley Trust. This meant a delay in sending across Applicant Files to ESR.
...
Trigger
Unexpected number of FULL FILES (RMF) causing an overload on the TIS-ESR interface.
...
Detection
An issue raised on Teams regarding missing applicants from Liam Lofthouse (NWM Data Lead)
...
Resolution
...
Timeline
- 17:28-17:38 - 7 RMF files received (EMD, KSS, EOE, MER, OXF, NWN, LDN)
- 17:45 - MongoDB down
- 18:09 - MongoDB manually restarted
- 18:10 - MongoDB up
- 19:10 - MongoDB down
- 19:20 - MongoDB up
- 21:36 - MongoDB down
- 22:31 - MongoDB up
- 15:19 - Live defect https://hee-tis.atlassian.net/browse/TIS21-1265 created
- 16:57 - WMD RMF received
- 18:30 - WMD RMF received
- 15:21 - Query raised on teams about unreceived data
...
Root Cause(s)
...
Action Items
Action Items | Owner |
---|---|
n/a |
...
Lessons Learned
...