2020-10-03 ESR New world switchover incident
Date | Oct 3, 2020 |
Authors | @Andy Nash (Unlicensed) |
Status | In progress |
Summary | On Friday evening we went for the ESR New world switchover. Later in the evening it was clear all was not well… Related to some snagging issues described below |
Impact | ETLs compromised, Jenkins down, Metabase down, Sentry overloaded, Data exporter down… |
Timeline
Time | Specifics of problem(s) encountered | Resolution(s) | Mitigation(s) |
---|---|---|---|
Fri: 14:46 | Confirmed that we did get RMF files from ESR (yesterday), and we likely didnt process them. |
|
|
15:40 | Jenkins pipeline running - would have got in the way of evening ESR ETLs | Stopped the pipeline |
|
15:41 | Jenkins went down (15:41) | Stopped containers: https://hee-nhs-tis.slack.com/files/U4AQK274J/F01BS8SJWHK/untitled |
|
| Metabase was taken down as part of this | Brought back up |
|
16.20 | Metabase unavailable - something to do with the recent password change? (16.20) |
|
|
| Whole stack of Sentry-ESR errors/warnings (close to maxing out the Sentry allocation again) |
|
|
19:19 | Invalid ESR-provided files identified |
|
|
22:05 | Data exporter very broken… (only test cases) | Indexes were created manually in AWS environment | Might need to create all indexes again in AWS environment to avoid such scenario happening again |
23:55 | AWS Monitor service non-operational Slack notification: MongoDB replicaset MongoDB kept going down while processing notification messages due to high CPU utilisation | Had to restart EC2 instance (00:28) | Increase RAM? |
02:41 | Notification service docker container was removed as it was not running by Jenkins job Reason: https://build.tis.nhs.uk/jenkins/job/docker-clean-prod/ | Re-Run notification service build pipeline to deploy the container again in Prod. John removed schedule of this job. | 3rd time this has caught us out recently. |
02:44 | Notification generation is failing because of too stringent validation on host/lead field for notification generation.
| Removed need to supply host/lead in the notification and reran loading process successfully |
|
Root Cause(s)
Two main issues:
Snagging issues around data formating and ESR spec issues
Performance on running all data lead to some strain on mongo db for audit service
Trigger
Initial large volume of processing highlighted:
Large number of rejections
Initial strain on mongodb
Resolution
Tweaked code to match acceptable format for ESR
Mongo DB restart was enough to continue processing
Detection
Detected in uptime monitors and sentry
Action Items
Action Items | Owner |
---|---|
|
|
|
|
|
|
Lessons Learned
More a reminder that the NHS098 is in a terrible state
Slack: https://hee-nhs-tis.slack.com/
Jira issues: https://hee-tis.atlassian.net/issues/?filter=14213