Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

Date

Authors

Joseph (Pepe) Kelly

Status

Documenting

Summary

ETL failed after modifying the configuration

Impact

Data in NDW was 1 day out of date for a short perion out-of-hours

Non-technical Description

As part of moving the HEE ETLs, the configuration identified additional availability zone (AZ) that this ETL could run in. The ETL attempted to run in this additional AZ but it was not fully configured and failed.


Trigger

Additional AZ added to configuration, meant the job could have and did run in a location that didn’t have the required access.


Detection

Notifications in the #monitoring-ndw channel.

Notification in the #monitoring-prod channel confirmed the ETL ran at the wrong time and took too long to complete.

Data lead and NDW team member enquiries in shared #tis-ndw-etl channel Slack channel.


Resolution

Altered configuration to run on the same subnet as the database.


Timeline

: 02:30 - Failure message in the #monitoring-ndw channel

: 03:00 to 03:20 - Configuration modified and job rerun

: - Configuration (IaC) definition modified to use the subnet that the source database runs on.

w/c :- Additional unused infrastructure decommisioned

Root Cause(s)

Additional infrastructure created outside of normal processes.

Change applied with assumptions about correctness of IaC definitions.

Lessons Learned

Action Items

Action Items

Owner

Status

Update the ETL timings page. Use wider list of ETLs, i.e. image with swim-lanes above.

Ashley Ransoo

Page updated. Diagram with swimlanes/showing full dependency graph still needs to be updated.

  • No labels