Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Date

Authors

Joseph (Pepe) Kelly

Status

DocumentingDone

Summary

ETL failed after modifying the configuration

Impact

Data in NDW was 1 day out of date for a short perion period out-of-hours

Non-technical Description

As part of moving the HEE ETLs, the configuration identified additional availability zone (AZ) that this ETL could run inlocation we moved the ETLs to was slightly mis-configured. The ETL attempted to run in this additional AZ but it was not fully configured with this mis-configuration and failed.

...

Trigger

Additional availability zone (AZ) added to configuration, meant the job could have, and did, run in a location that didn’t have the required access.

...

Notifications in the #monitoring-ndw channel.

Notification in the #monitoring-prod channel confirmed the ETL ran at the wrong time and took too long to complete.

Data lead and NDW team member enquiries in shared #tis-ndw-etl channel Slack channel.

...

TIS Data Manager confirmed no downstream processes were affected.

...

Resolution

Re-ran in fully configured part of network.

Long-term:

  1. Altered configuration to run on the same subnet as the database.

  2. Removed unused/partially functional subnet

...

Timeline

: 02:30 - Failure message in the #monitoring-ndw channel

...

w/c :- Additional unused infrastructure decommisioneddecommissioned

Root Cause(s)

Additional infrastructure created outside of normal processes.

Change applied with assumptions about correctness of IaC definitions.

Lessons Learned

  • Just because it looks right, it doesn’t mean it is.

Action Items

Action Items

Owner

Status

Update the ETL timings page. Use wider list of ETLs, i.e. image with swim-lanes above.

Ashley Ransoo

Page updated. Diagram with swimlanes/showing full dependency graph still needs to be updated.