Date |
|
Authors | |
Status | In Progress |
Summary | ETL failed after an initial attempt at moving it into our new AWS ECS environment |
Impact | Downstream NDW and Hicom processes failed - data in NDW is now 1 day out of date |
Non-technical Description
TIS team moved our ETL from Jenkins to AWS environment. Although it ran, it started later than needed and took longer to complete. As a result the downstream NDW processes that begin at 03.45 had no fresh data to work with, and the Hicom pull into Leave Manager could not pull live data.
Trigger
Moving the ETL from Jenkins to ECS, not starting it at the right time and not optimising the speed with which it completed to ensure had it started at the right time, it would have completed in time for the downstream NDW and Hicom processes to be able to use current data.
Detection
No alerts in the #monitoring-ndw channel (alerts had been moved to the #monitoring-prod channel).
Data lead and NDW team member enquiries in shared Slack channel
Resolution
Revert to a manual reuse of the Jenkins ETL in order that downstream processes can be completed later today.
Re-work the new ECS ETLs to start after the TIS Sync jobs on which they depend have finished (i.e. after 02.40), and complete before the NDW and Hicom processes that depend on the ETLs begin (i.e. before 03.45).
Timeline
: 03:15 - No alert in the #monitoring-ndw channel at the expected time (c.03.15 - 03:35)
: 04:00 - An alert in the #monitoring-prod channel of ETL success but in the wrong time slot (c.04.00 - 05:30)
: 10:12 - Users in the #tis-ndw-etl channel highlighted old data in the NDW
: 10:17 - Confirmed to users we were aware and looking into the issue
: 10:20-11:30 - Discussions in TIS team to identify exactly what had happened/been missed and what to do to resolve it. This resulted in an agreement to reschedule the AWS ECS job to start shortly after 02:40 (02:45 perhaps), and optimise the performance of the job to ensure it completed before 03:45
: 11:55 - Pepe restarted the job from Jenkins instead of AWS ECS
: 12:00 - Sarah Krause confirms that the London loads keep trying until they can grab new data, so were able to process data when the new ETL completed (c.05:30).
: 12:39 - Pepe confirms the successful re-running of the TIS NDW ETL, and informs NDW they can re-run their downstream job
Root Cause(s)
Focused on getting the ETLs to work in ECS. Lack of focus on the scheduling of the ETLs and the speed with which they needed to be processed in order to not impact dependencies.
Lessons Learned
Be carefully when modifying ETLs. Modify the ETL on Stage first. Consider carefully whether it would be best to run this overnight before then promoting the modification to Prod the next day on confirmation of successful execution of modification on Stage. Refer to the image to the right when modifying in future (and keep that image up to date). Keep the overall ETL Timings (re-draft - for discussion) page up to date |
.
Add Comment