2021-07-23 NIMDTA TIS-NDW ETL failure

Date

Jul 23, 2021

Authors

@Joseph (Pepe) Kelly

Status

Done

Summary

ETL failed after modifying the configuration

Impact

Data in NDW was 1 day out of date for a short period out-of-hours

Non-technical Description

As part of moving the HEE ETLs, the location we moved the ETLs to was slightly mis-configured. The ETL attempted to run with this mis-configuration and failed.


Trigger

Additional availability zone (AZ) added to configuration, meant the job could have, and did, run in a location that didn’t have the required access.


Detection

Notifications in the #monitoring-ndw channel.

TIS Data Manager confirmed no downstream processes were affected.


Resolution

Re-ran in fully configured part of network.

Long-term:

  1. Altered configuration to run on the same subnet as the database.

  2. Removed unused/partially functional subnet


Timeline

Jul 23, 2021: 02:30 - Failure message in the #monitoring-ndw channel

Jul 23, 2021: 03:00 to 03:20 - Configuration modified and job rerun

Jul 23, 2021: - Configuration (IaC) definition modified to use the subnet that the source database runs on.

w/c Jul 26, 2021:- Additional unused infrastructure decommissioned

 

Root Cause(s)

Additional infrastructure created outside of normal processes.

Change applied with assumptions about correctness of IaC definitions.

Lessons Learned

  • Just because it looks right, it doesn’t mean it is.

Action Items

Action Items

Owner

Status

Action Items

Owner

Status

Update the ETL timings page. Use wider list of ETLs, i.e. image with swim-lanes above.

@Ashley Ransoo

Page updated. Diagram with swimlanes/showing full dependency graph still needs to be updated.