Date	03 Mar 2021
Authors	Joseph (Pepe) Kelly , Ashley Ransoo, Andy Dingley
Status	ResolvedIn Progress
Summary	ESR sent loads of files and it looks like we haven’t captured everything from
Impact

Non-technical Description

Trigger

Detection

The issue was detected when errors were noted in the ‘Sync [Person sync job] started’ Slack notification, and the ‘Sync [Person sync job] finished’ notification failed to appear.

...

Resolution

Created a new production elasticsearch cluster to use based on the terraform description (instance_type = t3.medium.elasticsearch up from instance_type = t3.small.elasticsearch)
Manually triggered the Person sync job to rebuild the Person elasticsearch index.

Timeline

12 Feb 2021 - 06:20 - Noted Person sync job errors on STAGE and PROD
12 Feb 2021 - 06:29 - Quickest fix (simply re-running the job) observed not to resolve the issue on STAGE
12 Feb 2021 - 07:35 - Question raised by user on Teams
12 Feb 2021 - 07:40-07:50 - Rebuilt the elasticsearch cluster infrastructure as noted in ‘Resolution’ above.
12 Feb 2021 - 07:51-07:58 - Manually re-ran the Sync job; trainees becoming visible during this time.
12 Feb 2021 - 08:00 - Confirmed issue resolved with users

...

Root Cause(s)

The nightly sync job failed.
There were too many requests to update the index.
The Person elasticsearch cluster had been reconfigured on 11 Jan 2021 with t3.small.elasticsearch instances that were too small.
The sync service hasn’t been built to respond to “back pressure” or retry failed ‘chunks’ of data.

ESR sent through a number of FULL FILES on 1st of March that did not load or fully load/reconcile to then send applicants against subsequently for St Helens & Knowsley Trust. This meant a delay in sending across Applicant Files to ESR.

...

Trigger

Unexpected number of FULL FILES (RMF) causing an overload on the TIS-ESR interface.

...

Detection

An issue raised on Teams regarding missing applicants from Liam Lofthouse (NWM Data Lead)
Image Added

...

Resolution

...

Timeline

01 Mar 2021 - 17:28-17:38 - 7 RMF files received (EMD, KSS, EOE, MER, OXF, NWN, LDN)
01 Mar 2021 - 17:45 - MongoDB down
01 Mar 2021 - 18:09 - MongoDB manually restarted
01 Mar 2021 - 18:10 - MongoDB up
01 Mar 2021 - 19:10 - MongoDB down
01 Mar 2021 - 19:20 - MongoDB up
01 Mar 2021 - 21:36 - MongoDB down
01 Mar 2021 - 22:31 - MongoDB up
02 Mar 2021 - 15:19 - Live defect https://hee-tis.atlassian.net/browse/TIS21-1265 created
02 Mar 2021 - 16:57 - WMD RMF received
02 Mar 2021 - 18:30 - WMD RMF received
03 Mar 2021 - 15:21 - Query raised on teams about unreceived data

...

Root Cause(s)

...

Action Items

Action Items	Owner
n/a

...

Lessons Learned

...

Versions Compared

Old Version 1

New Version 2

Key

Non-technical Description

Trigger

Detection

Resolution

Timeline

Root Cause(s)

Trigger

Detection

Resolution

Timeline

Root Cause(s)

Action Items

Lessons Learned

Page Comparison

Versions Compared

Old Version 1

New Version 2

Key

Non-technical Description

Trigger

Detection

Resolution

Timeline

Root Cause(s)

Trigger

Detection

Resolution

Timeline

Root Cause(s)

Action Items

Lessons Learned