2018-10-17 ESR ETL failed
Date | |
Authors | Ashley Ransoo, Jayanta Saha, Oladimeji Onalaja (Unlicensed), Chris Mills (Unlicensed) |
Status | Results of issue resolved. Step to mitigate it happening again...? |
Summary | ESR ETL that runs at 02.00 AM each morning failed. Any jobs that runs thereafter were also found to be erroring. No Applicant export files (placements information for new trainees) we sent on 17/10. The TIS team were unable to run the esr-parametrized-etl successfully until after 17.30 after Chris made a fix, meaning a downstream process was compromised for that day. |
Impact | Notification (changes to placements) were not produced by the 2:00 AM job and no APP (applicant export files) were produced by the 3.30 PM job on 17/10. |
Jira reference
- TISNEW-2109Getting issue details... STATUS .
Impact
- The job that was meant to run at 02:00 produces the notifications (changes to placements that have taken place during the day, for e.g. withdrawal of trainees, rotations etc.).
- The job that was supposed to run at 17:30 would have missed out those notifications if we did not run them manually before then. However, the fix was not applied in time and the earlier jobs at 15:30 failing to produce the Applicant Export Files for the day.
Details of the scheduled jobs are here: ESR Schedules.
Root Causes
- Docker container failure due to old versions of docker having issues with zombie containers. Use of docker internal network for large volume of network traffic.
Trigger
- .
Resolution
- Restarted docker service
- Restarted ESR
- Chris identified the correct ETLs to run manually for the Notifications Load in time (around 17.38) before the next automatic Job at 18:00, however, not quite in time as the APP files that are produced by the 15.30 job were missed.
- ETL ran OK on subsequent jobs for the various load types through to the 2:00 morning job following the initial problem.
Detection / Timeline
- 02.00 Wed: ESR ETL failed.
- 02.01 Wed: Notification of failure in the #esr_operations Slack channel.
- 09.20 Wed: Ashley alerted the whole team to the issue.
- 09.30 Wed: TIS team Sprint Day - Sprint Review began (whole TIS team involved).
- 13.00 Wed: Jay offered to re-run the ETL.
- 13.03 Wed: Ashley encouraged caution unless it was clear which job ran the notifications load.
- 14.26 Wed: Jay ran the tis-esr-parameterized ETL . Team didn't feel there was any way to know whether this was the correct fix until the 17.30 job ran. [Note: this is a big concern]
- 14:49 Wed: Ola asked for help, we restarted docker service and assumed this would be fine.
- 17.22 Wed: Jay asked for Chris' help (out of hours) after seeing errors in the docker log ("Application startup failed....) in the ESR app, not the esr etl app.
- 17.37 Wed: Chris restarted the esr appliction and fixed the problem (7 minutes late for the 17.30 job).
- 17:37 Wed: Chris ran the job manually
Action Items
- Check ESR ETL documentation is up to date - assumption is that it can't be.
- If the ESR ETL fails it needs to be clear how to diagnose possible causes and what to do to correct it, before the 17.30 job runs.
- Documentation around how and when to run tis-esr-etl-cron, tis-esr-parametrized and what each parameter does, so that anyone with access should be able to confidently run a failed ESR job in the future. They should co-relate to the business requirements here ESR Schedules.
Lessons Learned
- Don't be afraid to ask for help, especially when you know you are panicking (which everyone does, especially when they feel they have, or should have, the skills to fix the problem themselves, but can't seem to as quickly as they anticipate).
- Communicate (who might have some insight into the problem?).
- Don't assume things needs to be fixed out of hours. They don't.
- #esr_operations channel to be monitored by the wider team
What went well
What went wrong
- The Dev that took the initial action was uncertain whether the action they took would be successful.
- They took the action without any way to check it was the correct action until it was too late.
- There was no communication with the ESR team to alert them to the problem.
- Developers had no idea how do debug simple docker problems.
- Confusion around which application does what. ESR and ESR ETL.
Where we got lucky
- No one in the business has yet flagged the problem.
- We were in time not to miss the notifications made on 16/10 to be sent, failing that it would have required more manual steps to bring the data on ESR in synch.
- APP files missed on 17/10 would be automatically picked up on the following day jobs, so no further manual intervention.
Supporting information
- Slack conversation in: #esr_operations
tis-esr-etl-cron job is automatically run at the following times daily - 02.00, 03.30, 17.00, 17.30 and 20.00.
Slack: https://hee-nhs-tis.slack.com/
Jira issues: https://hee-tis.atlassian.net/issues/?filter=14213