2020-10-03 ESR New world switchover incident

Date

Oct 3, 2020

Authors

@Andy Nash (Unlicensed)

Status

In progress

Summary

On Friday evening we went for the ESR New world switchover.

Later in the evening it was clear all was not well…

Related to some snagging issues described below

Impact

ETLs compromised, Jenkins down, Metabase down, Sentry overloaded, Data exporter down…

 

Timeline

Time

Specifics of problem(s) encountered

Resolution(s)

Mitigation(s)

Time

Specifics of problem(s) encountered

Resolution(s)

Mitigation(s)

Fri: 14:46

Confirmed that we did get RMF files from ESR (yesterday), and we likely didnt process them.
https://hee-nhs-tis.slack.com/archives/CBKRLAWMD/p1601646419014000

 

 

15:40

Jenkins pipeline running - would have got in the way of evening ESR ETLs

Stopped the pipeline

 

15:41

Jenkins went down (15:41)
Initial investigation indicated some recent commits probably kicked off some ESR builds and caused it to spin off almost 150 docker containers for Localstack. It ate up all memory and killed Jenkins again…Exporter?…Rabbit containers?…57 containers were reduced to 30 sometime ago, but 30 is still a lot?… Looks like it got stuck (yesterday)

Stopped containers: https://hee-nhs-tis.slack.com/files/U4AQK274J/F01BS8SJWHK/untitled

 

 

Metabase was taken down as part of this

Brought back up

 

16.20

Metabase unavailable - something to do with the recent password change? (16.20)

 

 

 

Whole stack of Sentry-ESR errors/warnings (close to maxing out the Sentry allocation again)
https://hee-nhs-tis.slack.com/archives/C016ZMPHFJ6/p1601661310000100

 

 

19:19

Invalid ESR-provided files identified

 

 

22:05

Data exporter very broken… (only test cases)
https://hee-nhs-tis.slack.com/archives/GHFSS7ANT/p1601672719106700

Indexes were created manually in AWS environment

Might need to create all indexes again in AWS environment to avoid such scenario happening again
https://hee-nhs-tis.slack.com/archives/GHFSS7ANT/p1601675226108500

23:55
Sat, 01:48, 01:53, 02:03, 02:13

AWS Monitor service non-operational Slack notification: MongoDB replicaset

MongoDB kept going down while processing notification messages due to high CPU utilisation

Had to restart EC2 instance (00:28)

Increase RAM?
Create cluster in separate instances rather than keeping in one EC2 instance?

02:41

Notification service docker container was removed as it was not running by Jenkins job

Reason: https://build.tis.nhs.uk/jenkins/job/docker-clean-prod/

Re-Run notification service build pipeline to deploy the container again in Prod.

John removed schedule of this job.

3rd time this has caught us out recently.
Should we consider the usefulness of this job?

02:44

Notification generation is failing because of too stringent validation on host/lead field for notification generation.

 

 Removed need to supply host/lead in the notification and reran loading process successfully

 

Root Cause(s)

  • Two main issues:

    • Snagging issues around data formating and ESR spec issues

    • Performance on running all data lead to some strain on mongo db for audit service

Trigger

  • Initial large volume of processing highlighted:

    • Large number of rejections

    • Initial strain on mongodb

Resolution

  • Tweaked code to match acceptable format for ESR

  • Mongo DB restart was enough to continue processing

Detection

  • Detected in uptime monitors and sentry

Action Items

Action Items

Owner

Action Items

Owner

  • Investigate moving mongo db to a more performant platform

 

 

 

 

Lessons Learned

  • More a reminder that the NHS098 is in a terrible state