2021-03-10 The NDW ETL failed and didn't recover for PROD & STAGE
Date | Mar 10, 2021 |
Authors | @Philip Wilsdon (Unlicensed) |
Status | Done |
Summary | |
Impact |
|
Non-technical Description
There was an interruption that stopped TIS data being sent to Hicom via the NDW.
Trigger
Connection termination, probably due to NDW maintenance
Detection
Alerting in our monitoring channel
Resolution
Reran the ETLs (resolved some downstream impact)
Intrepid Leave Manager
Timeline
Mar 10, 2021 - 04:41 & 05:02 - Alerts in the NDW monitoring channel
Mar 10, 2021 - 08:30 - Notification to users on slack
Mar 10, 2021 - 09:06 - NDW succesfuly re-run on stage and prod
Mar 10, 2021 - 11:45 - NDW team confirm downstream
Mar 10, 2021 - ~midday - Agreed with HICOM less disruption by waiting for tomorrow rather than re-running jobs
Mar 11, 2021 - 10:50 - Confirmed that Leave Manager looks correct; standard checks are displaying normal results.
Root Cause(s)
The Ansible job timed out.
The ETL kept retrying a failed chunk. Retries all failed as well:
All connections in the pool to NDW got closed and didn’t get re-created.
Action Items
Action Items | Owner |
---|---|
Ansible retries (Park this for now?) | @John Simmons (Deactivated) |
@Marcello Fabbri (Unlicensed) @Edward Barclay | |
@Edward Barclay @Marcello Fabbri (Unlicensed) | |
@Andy Dingley |
Lessons Learned
Slack: https://hee-nhs-tis.slack.com/
Jira issues: https://hee-tis.atlassian.net/issues/?filter=14213