2024-01-26 NDW ETL failed part way through

Date

Jan 8, 2024

Authors

@Joseph (Pepe) Kelly

Status

Documenting

Summary

The ETL that feeds reporting failed and meant that partial information was picked up and made available in the NDW

Impact

Reports would have been missing Post Funding and based on a mix of data as of 25th & 26th Jan

Non-technical Description

Data storage (ETL)  that feeds information for use failed which meant that not all data were processed or completed as a result users were getting incomplete data. There were some changes related to Post Funding which had been released to the pre production environment for testing. The “Continuous Deployment” approach meant that the job that creates information in the Data Warehouse was updated before the source of the new information had been created. This meant that it was unable to copy any post funding information across.

A user reported the issue in a report that uses the Post Funding information. The data was refreshed, updating the report.

We have now released the latest versions of the source and job, meaning that as Post Funding information is added, it will be available in the NDW and available for new or updated reports.


Trigger

  • Automated deployment to prod


Detection

  • Slack


Resolution

  • Re-ran ETL with previous version

  • Set scheduled run to use previous version


Timeline

All times in GMT unless indicated

  • Jan 25, 2024 - Change to the preprod environment that required a change in the ETL

  • Jan 26, 2024 01:07 - Matching change to the ETL is made

  • Jan 26, 2024 01:25 - Pull Request merged

  • Jan 26, 2024 02:00 & 02:30 - The updated ETL is run for NHS E & NIMDTA respectively

  • Jan 26, 2024 05:39 - The previous versions are run on production

  • Jan 29, 2024 - Changes released to prod environments and updated the scheduled event to use latest task definitions

Root Cause(s)

  • We got a slack message that the ETL was running for NHS E & NIMDTA and there was no completion message

  • The ETL was retrying the step that creates Post Funding

  • The step was failing because the SQL included a field which didn’t exist

  • The ETL relied on some database changes which hadn’t been released to production sites yet

  • The ETL workflow automatically deploys unless it is cancelled in a 5 minute window

  • The workflow can’t have an approval step while the source is private and part of the current subscription

 

 


Action Items

Action Items

Owner

 

Action Items

Owner

 

Add max retries for all steps to reduce the scope of the failure

 

This does have value and we may look into this in the future but dependent on whether we want to stick with spring batch

Spike: What are the options for adding workflow approvals? explore options for enabling GHA workflow approvals:

  1. Using an Enterprise subscription

  2. Make repository public

@Joseph (Pepe) Kelly

 


Lessons Learned

  •