Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Date

Authors

Joseph (Pepe) Kelly , Ashley Ransoo, Andy Dingley

Status

ResolvedIn Progress

Summary

ESR sent loads of files and it looks like we haven’t captured everything from

Impact

...

Non-technical Description

Trigger

Detection

  • The issue was detected when errors were noted in the ‘Sync [Person sync job] started’ Slack notification, and the ‘Sync [Person sync job] finished’ notification failed to appear.

...

Resolution

  • Created a new production elasticsearch cluster to use based on the terraform description (instance_type = t3.medium.elasticsearch up from instance_type = t3.small.elasticsearch)

  • Manually triggered the Person sync job to rebuild the Person elasticsearch index.

Timeline

  • - 06:20 - Noted Person sync job errors on STAGE and PROD

  • - 06:29 - Quickest fix (simply re-running the job) observed not to resolve the issue on STAGE

  • - 07:35 - Question raised by user on Teams

  • - 07:40-07:50 - Rebuilt the elasticsearch cluster infrastructure as noted in ‘Resolution’ above.

  • - 07:51-07:58 - Manually re-ran the Sync job; trainees becoming visible during this time.

  • - 08:00 - Confirmed issue resolved with users

...

Root Cause(s)

  • The nightly sync job failed.

  • There were too many requests to update the index.

  • The Person elasticsearch cluster had been reconfigured on 11 Jan 2021 with t3.small.elasticsearch instances that were too small.

  • The sync service hasn’t been built to respond to “back pressure” or retry failed ‘chunks’ of data.

ESR sent through a number of FULL FILES on 1st of March that did not load or fully load/reconcile to then send applicants against subsequently for St Helens & Knowsley Trust. This meant a delay in sending across Applicant Files to ESR.

...

Trigger

Unexpected number of FULL FILES (RMF) causing an overload on the TIS-ESR interface.

...

Detection

  • An issue raised on Teams regarding missing applicants from Liam Lofthouse (NWM Data Lead)

  • Image Added

...

Resolution

...

Timeline

  • - 17:28-17:38 - 7 RMF files received (EMD, KSS, EOE, MER, OXF, NWN, LDN)

  • - 17:45 - MongoDB down

  • - 18:09 - MongoDB manually restarted

  • - 18:10 - MongoDB up

  • - 19:10 - MongoDB down

  • - 19:20 - MongoDB up

  • - 21:36 - MongoDB down

  • - 22:31 - MongoDB up

  • - 15:19 - Live defect https://hee-tis.atlassian.net/browse/TIS21-1265 created

  • - 16:57 - WMD RMF received

  • - 18:30 - WMD RMF received

  • - 15:21 - Query raised on teams about unreceived data

...

Root Cause(s)

...

Action Items

Action Items

Owner

n/a

...

Lessons Learned

...