Date | |
Authors | Ashley Ransoo, Jayanta Saha, Oladimeji Onalaja (Unlicensed), Chris Mills (Unlicensed) |
Status | Results of issue resolved. Step to mitigate it happening again...? |
Summary | ESR ETL that runs at 02.00 each morning failed. TIS team were unable to run it successfully until after 17.30, meaning a downstream process was compromised |
Impact | During ARCP reporting season, local offices are unable to run extracts needing to be sent to the GMC |
Jira reference
- TISNEW-2109Getting issue details... STATUS .
Impact
- The job that was meant to run at 02:00 produces the notifications (changes to placements that have taken place during the day, for e.g. withdrawal of trainees, rotations etc.).
- The job that was supposed to run at 17:30 missed out those notifications.
Details of the scheduled jobs are here: ESR Schedules.
Root Causes
- .
Trigger
- .
Resolution
- Chris identified the correct ETLs to run and ran them, but not quite in time for the secondary job.
- ETL ran ok the morning following the initial problem.
Detection / Timeline
- 02.00 Wed: ESR ETL failed.
- 02.01 Wed: Notification of failure in the #esr_operations Slack channel.
- 09.20 Wed: Ashley altered the whole team to the issue.
- 09.30 Wed: TIS team Sprint Day - Sprint Review began (whole TIS team involved).
- 13.00 Wed: Jay offered to re-run the ETL.
- 13.03 Wed: Ashley encouraged caution unless it was clear which job ran the notifications load.
- 14.26 Wed: Jay ran the tis-esr-parameterized ETL. Team didn't feel there was any way to know whether this was the correct fix until the 17.30 job ran. [Note: this is a big concern]
- 17.22 Wed: Jay asked for Chris' help (out of hours)
- 17.37 Wed: Chris restarted the application and fixed the problem (7 minutes late for the 17.30 job).
Action Items
- Check ESR ETL documentation is up to date - assumption is that it can't be.
- If the ESR ETL fails it needs to be clear how to diagnose possible causes and what to do to correct it, before the 17.30 job runs.
Lessons Learned
- Don't be afraid to ask for help, especially when you know you are panicking (which everyone does, especially when they feel they have, or should have, the skills to fix the problem themselves, but can't seem to as quickly as they anticipate).
- Communicate (who might have some insight into the problem?).
What went well
- .
What went wrong
- The Dev that took the initial action was uncertain whether the action they took would be successful.
- They took the action without any way to check it was the correct action until it was too late.
- There was no communication with the ESR team to alert them to the problem.
Where we got lucky
- No one in the business has yet flagged the problem.
Supporting information
- Slack conversation in: #esr_operations
tis-esr-etl-cron job is automatically run at the following times daily - 02.00, 03.30, 17.00, 17.30 and 20.00.
Add Comment