2021-07-23 TIS-NDW ETL failure
Date | Jul 23, 2021 |
Authors | @Joseph (Pepe) Kelly @Jayanta Saha |
Status | Done |
Summary | ETL failed after an initial attempt at moving it into our new AWS ECS environment |
Impact | Downstream NDW and Hicom processes failed - data in NDW was 1 day out of date (this has now been fixed) |
Non-technical Description
The TIS team moved our ETL from a single to AWS environment. Although it ran, it started later than needed and took longer to complete. As a result the downstream NDW processes that begin at 03.45 had no fresh data to work with, and the Hicom pull into Leave Manager could not pull live data.
Trigger
Moving the ETL from Jenkins to ECS, meant the job didn’t run within required constraints (completed by the required time).
Detection
No notifications in the #monitoring-ndw channel. Job notifications appeared in the #monitoring-prod channel.
Notification in the #monitoring-prod channel confirmed the ETL ran at the wrong time and took too long to complete.
Data lead and NDW team member enquiries in shared #tis-ndw-etl channel Slack channel.
Resolution
Revert to a manual reuse of the Jenkins ETL in order that downstream processes can be completed later today.
Re-work the new ECS ETLs to start after the TIS Sync jobs on which they depend have finished (i.e. after 02.40), and complete before the NDW and Hicom processes that depend on the ETLs begin (i.e. before 03.45).
Timeline
Jul 23, 2021: 03:15 - No notification in the #monitoring-ndw channel at the expected time (c.03.15 - 03:35)
Jul 23, 2021: 04:00 - Notification in the #monitoring-prod channel of ETL success but in the wrong time slot (c.04.00 - 05:30)
Jul 23, 2021: 10:12 - Users in the #tis-ndw-etl channel highlighted old data in the NDW
Jul 23, 2021: 10:17 - Confirmed to users we were aware and looking into the issue
Jul 23, 2021: 10:20-11:30 - Discussions in TIS team to identify exactly what had happened/been missed and what to do to resolve it. This resulted in an agreement to reschedule the AWS ECS job to start shortly after 02:40 (02:45 perhaps), and optimise the performance of the job to ensure it completed before 03:45
Jul 23, 2021: 11:04 - Pepe started the job from Jenkins instead of AWS ECS but this failed and was restarted at 11:55.
Jul 23, 2021: 12:00 - Sarah Krause confirms that the London loads keep trying until they can grab new data, so were able to process data when the new ETL completed (c.05:30).
Jul 23, 2021: 12:39 - Pepe confirms the successful re-running of the TIS NDW ETL, and informs NDW they can re-run their downstream job
Root Cause(s)
Focused on getting the ETLs to work in ECS.
Verification in preprod did not cover all aspects: Notifications, constraints on execution time.
Misalignment of time zones for different Lack of focus on the scheduling of the ETLs and the speed with which they needed to be processed in order to not impact dependencies.
Lessons Learned
Be carefully when modifying ETLs:
Refer to the image to the right when modifying in future (and keep that image up to date). Keep the overall TIS Scheduled Jobs & Timings page up to date |
|
.
Action Items
Action Items | Owner | Status |
---|---|---|
Adjusted timing and resource allocation for TIS NDW ETLs, after 0245 and before 03:45 each morning. | @Joseph (Pepe) Kelly | Done. Starts at 3AM BST (rather than 3AM GMT) and configuration has been adjusted to bring duration down. |
Update the ETL timings page. Use wider list of ETLs, i.e. image with swim-lanes above. | @Ashley Ransoo | Page updated. Diagram with swimlanes/showing full dependency graph still needs to be included. |
Add/Update Slack Notifications https://hee-tis.atlassian.net/browse/TIS21-1884?atlOrigin=eyJpIjoiMmNlMTlkNjlhODZiNDg5Mzg2Yzg3MDEyZDg3MzExZjYiLCJwIjoiamlyYS1zbGFjay1pbnQifQ |
| Backlog. |
Slack: https://hee-nhs-tis.slack.com/
Jira issues: https://hee-tis.atlassian.net/issues/?filter=14213