...
Details of the scheduled jobs are here: ESR Schedules.
Root Causes
- Chris Mills (Unlicensed), do you know? - as ESR_ETL has not changed since end of July and all the jobs have successfully run since. Why was a restart required?Docker container failure due to old versions of docker having issues with zombie containers. Use of docker internal network for large volume of network traffic.
Trigger
- .
Resolution
- Chris made a fix/restarted. Details of the fix is.....???Restarted docker service
- Restarted ESR
- Chris identified the correct ETLs to run manually for the Notifications Load in time (around 17.38) before the next automatic Job at 18:00, however, not quite in time as the APP files that are produced by the 15.30 job were missed.
- ETL ran OK on subsequent jobs for the various load types through to the 2:00 morning job following the initial problem.
...
- 02.00 Wed: ESR ETL failed.
- 02.01 Wed: Notification of failure in the #esr_operations Slack channel.
- 09.20 Wed: Ashley alerted the whole team to the issue.
- 09.30 Wed: TIS team Sprint Day - Sprint Review began (whole TIS team involved).
- 13.00 Wed: Jay offered to re-run the ETL.
- 13.03 Wed: Ashley encouraged caution unless it was clear which job ran the notifications load.
- 14.26 Wed: Jay ran the tis-esr-parameterized ETL . Team didn't feel there was any way to know whether this was the correct fix until the 17.30 job ran. [Note: this is a big concern]
- 14:49 Wed: Ola asked for help, we restarted docker service and assumed this would be fine.
- 17.22 Wed: Jay asked for Chris' help (out of hours) after seeing errors in the docker log ("Application startup failed....) in the ESR app, not the esr etl app.
- 17.37 Wed: Chris restarted the application esr appliction and fixed the problem (7 minutes late for the 17.30 job).
- 17:37 Wed: Chris ran the job manually
Action Items
- Check ESR ETL documentation is up to date - assumption is that it can't be.
- If the ESR ETL fails it needs to be clear how to diagnose possible causes and what to do to correct it, before the 17.30 job runs.
- Documentation around how and when to run tis-esr-etl-cron, tis-esr-parametrized and what each parameter does, so that anyone with access should be able to confidently run a failed ESR job in the future. They should co-relate to the business requirements here ESR Schedules.
...
- Don't be afraid to ask for help, especially when you know you are panicking (which everyone does, especially when they feel they have, or should have, the skills to fix the problem themselves, but can't seem to as quickly as they anticipate).
- Communicate (who might have some insight into the problem?).
- Don't assume things needs to be fixed out of hours. They don't.
- #esr_operations channel to be monitored by the wider team
...
- The Dev that took the initial action was uncertain whether the action they took would be successful.
- They took the action without any way to check it was the correct action until it was too late.
- There was no communication with the ESR team to alert them to the problem.
- Developers had no idea how do debug simple docker problems.
- Confusion around which application does what. ESR and ESR ETL.
Where we got lucky
- No one in the business has yet flagged the problem.
- We were in time not to miss the notifications made on 16/10 to be sent, failing that it would have required more manual steps to bring the data on ESR in synch.
- APP files missed on 17/10 would be automatically picked up on the following day jobs, so no further manual intervention.
...