Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

Date
 
AuthorsAshley RansooJayanta SahaOladimeji Onalaja (Unlicensed)Chris Mills (Unlicensed)
StatusResults of issue resolved. Step to mitigate it happening again...?
SummaryESR ETL that runs at 02.00 each morning failed. TIS team were unable to run it successfully until after 17.30, meaning a downstream process was compromised
ImpactDuring ARCP reporting season, local offices are unable to run extracts needing to be sent to the GMC

Jira reference

TISNEW-2109 - Getting issue details... STATUS

Impact

  • The job that was meant to run at 02:00 produces the notifications (changes to placements that have taken place during the day, for e.g. withdrawal of trainees, rotations etc.).
  • The job that was supposed to run at 17:30 missed out those notifications.

Details of the scheduled jobs are here: ESR Schedules.

Root Causes

  • .

Trigger

  • .

Resolution

  • Chris identified the correct ETLs to run and ran them, but not quite in time for the secondary job.
  • ETL ran ok the morning following the initial problem.

Detection / Timeline

  • 02.00 Wed: ESR ETL failed.
  • 02.01 Wed: Notification of failure in the #esr_operations Slack channel.
  • 09.20 Wed: Ashley altered the whole team to the issue.
  • 09.30 Wed: TIS team Sprint Day - Sprint Review began (whole TIS team involved).
  • 13.00 Wed: Jay offered to re-run the ETL.
  • 13.03 Wed: Ashley encouraged caution unless it was clear which job ran the notifications load.
  • 14.26 Wed: Jay ran the tis-esr-parameterized ETL. Team didn't feel there was any way to know whether this was the correct fix until the 17.30 job ran. [Note: this is a big concern]
  • 17.22 Wed: Jay asked for Chris' help (out of hours)
  • 17.37 Wed: Chris restarted the application and fixed the problem (7 minutes late for the 17.30 job).

Action Items

  • Check ESR ETL documentation is up to date - assumption is that it can't be.
  • If the ESR ETL fails it needs to be clear how to diagnose possible causes and what to do to correct it, before the 17.30 job runs.

Lessons Learned

  • Don't be afraid to ask for help, especially when you know you are panicking (which everyone does, especially when they feel they have, or should have, the skills to fix the problem themselves, but can't seem to as quickly as they anticipate).
  • Communicate (who might have some insight into the problem?).

What went well

  • .

What went wrong

  • The Dev that took the initial action was uncertain whether the action they took would be successful.
  • They took the action without any way to check it was the correct action until it was too late.
  • There was no communication with the ESR team to alert them to the problem.

Where we got lucky

  • No one in the business has yet flagged the problem.

Supporting information

  • Slack conversation in: #esr_operations
  • tis-esr-etl-cron job is automatically run at the following times daily - 02.00, 03.30, 17.00, 17.30 and 20.00.

  • No labels