Date |
|
Authors | |
Status | DocumentingDone |
Summary | ETL failed after modifying the configuration |
Impact | Data in NDW was 1 day out of date for a short perion period out-of-hours |
Non-technical Description
As part of moving the HEE ETLs, the configuration identified additional availability zone (AZ) that this ETL could run inlocation we moved the ETLs to was slightly mis-configured. The ETL attempted to run in this additional AZ but it was not fully configured with this mis-configuration and failed.
...
Trigger
Additional availability zone (AZ) added to configuration, meant the job could have, and did, run in a location that didn’t have the required access.
...
Notifications in the #monitoring-ndw channel.
Notification in the #monitoring-prod channel confirmed the ETL ran at the wrong time and took too long to complete.
Data lead and NDW team member enquiries in shared #tis-ndw-etl channel Slack channel.
...
TIS Data Manager confirmed no downstream processes were affected.
...
Resolution
Re-ran in fully configured part of network.
Long-term:
Altered configuration to run on the same subnet as the database.
Removed unused/partially functional subnet
...
Timeline
: 02:30 - Failure message in the #monitoring-ndw channel
...
w/c :- Additional unused infrastructure decommisioneddecommissioned
Root Cause(s)
Additional infrastructure created outside of normal processes.
Change applied with assumptions about correctness of IaC definitions.
Lessons Learned
Just because it looks right, it doesn’t mean it is.
Action Items
Action Items | Owner | Status |
---|---|---|
Update the ETL timings page. Use wider list of ETLs, i.e. image with swim-lanes above. | Page updated. Diagram with swimlanes/showing full dependency graph still needs to be updated. |