...
Date | |
Authors | Ashley Ransoo, Jayanta Saha, Oladimeji Onalaja (Unlicensed), Chris Mills (Unlicensed) |
Status | Results of issue resolved. Step to mitigate it happening again...? |
Summary | ESR ETL that runs at 02.00 AM each morning failed. Any jobs that runs thereafter were also found to be erroring. No Applicant export files (placements information for new trainees) we sent on 17/10. The TIS team were unable to run it the esr-parametrized-etl successfully until after 17.30 after Chris made a fix, meaning a downstream process was compromised for that day. |
ImpactDuring ARCP reporting season, local offices are unable to run extracts needing to be sent to the GMC | Notification (changes to placements) were not produced by the 2:00 AM job and no APP (applicant export files) were produced by the 3.30 PM job on 17/10. |
Table of Contents |
---|
Jira reference
...
- The job that was meant to run at 02:00 produces the notifications (changes to placements that have taken place during the day, for e.g. withdrawal of trainees, rotations etc.).
- The job that was supposed to run at 17:30 would have missed out those notifications if we did not run them manually before then. However, the fix was not applied in time and the earlier jobs at 15:30 failing to produce the Applicant Export Files for the day.
Details of the scheduled jobs are here: ESR Schedules.
Root Causes
- .Chris, do you know? - as ESR_ETL has not changed since end of July and all the jobs have successfully run since. Why was a restart required?
Trigger
- .
Resolution
- Chris made a fix/restarted. Details of the fix is.....???
- Chris identified the correct ETLs to run and ran them, but manually for the Notifications Load in time (around 17.38) before the next automatic Job at 18:00, however, not quite in time for the secondary jobas the APP files that are produced by the 15.30 job were missed.
- ETL ran ok the morning OK on subsequent jobs for the various load types through to the 2:00 morning job following the initial problem.
...
- 02.00 Wed: ESR ETL failed.
- 02.01 Wed: Notification of failure in the #esr_operations Slack channel.
- 09.20 Wed: Ashley altered alerted the whole team to the issue.
- 09.30 Wed: TIS team Sprint Day - Sprint Review began (whole TIS team involved).
- 13.00 Wed: Jay offered to re-run the ETL.
- 13.03 Wed: Ashley encouraged caution unless it was clear which job ran the notifications load.
- 14.26 Wed: Jay ran the tis-esr-parameterized ETL . Team didn't feel there was any way to know whether this was the correct fix until the 17.30 job ran. [Note: this is a big concern]
- 17.22 Wed: Jay asked for Chris' help (out of hours) after seeing errors in the docker log ("Application startup failed....)
- 17.37 Wed: Chris restarted the application and fixed the problem (7 minutes late for the 17.30 job).
...
- Check ESR ETL documentation is up to date - assumption is that it can't be.
- If the ESR ETL fails it needs to be clear how to diagnose possible causes and what to do to correct it, before the 17.30 job runs.
- Documentation around how and when to run tis-esr-etl-cron, tis-esr-parametrized and what each parameter does, so that anyone with access should be able to confidently run a failed ESR job in the future. They should co-relate to the business requirements here ESR Schedules.
Lessons Learned
- Don't be afraid to ask for help, especially when you know you are panicking (which everyone does, especially when they feel they have, or should have, the skills to fix the problem themselves, but can't seem to as quickly as they anticipate).
- Communicate (who might have some insight into the problem?).
- #esr_operations channel to be monitored by the wider team
What went well
- .
What went wrong
- The Dev that took the initial action was uncertain whether the action they took would be successful.
- They took the action without any way to check it was the correct action until it was too late.
- There was no communication with the ESR team to alert them to the problem.
...
- No one in the business has yet flagged the problem.
- We were in time not to miss the notifications made on 16/10 to be sent, failing that it would have required more manual steps to bring the data on ESR in synch.
- APP files missed on 17/10 would be automatically picked up on the following day jobs, so no further manual intervention.
Supporting information
- Slack conversation in: #esr_operations
tis-esr-etl-cron job is automatically run at the following times daily - 02.00, 03.30, 17.00, 17.30 and 20.00.