2021-03-10 The NDW ETL failed and didn't recover for PROD & STAGE

Date

Mar 10, 2021

Authors

@Philip Wilsdon (Unlicensed)

Status

Done

Summary

https://hee-tis.atlassian.net/browse/TIS21-1300

Impact

  • An increase in the number of helpdesk calls as people could not see trainees in Intrepid Leave Manager.   Some tableau reports would have been inaccurate in the morning.

Non-technical Description

There was an interruption that stopped TIS data being sent to Hicom via the NDW.


Trigger

  • Connection termination, probably due to NDW maintenance

 


Detection

  • Alerting in our monitoring channel

 


Resolution

  • Reran the ETLs (resolved some downstream impact)

  • Intrepid Leave Manager


Timeline

  • Mar 10, 2021 - 04:41 & 05:02 - Alerts in the NDW monitoring channel

  • Mar 10, 2021 - 08:30 - Notification to users on slack

  • Mar 10, 2021 - 09:06 - NDW succesfuly re-run on stage and prod

  • Mar 10, 2021 - 11:45 - NDW team confirm downstream

  • Mar 10, 2021 - ~midday - Agreed with HICOM less disruption by waiting for tomorrow rather than re-running jobs

  • Mar 11, 2021 - 10:50 - Confirmed that Leave Manager looks correct; standard checks are displaying normal results.

 


Root Cause(s)

  • The Ansible job timed out.

  • The ETL kept retrying a failed chunk. Retries all failed as well:

  • All connections in the pool to NDW got closed and didn’t get re-created.


Action Items

Action Items

Owner

Action Items

Owner

Ansible retries (Park this for now?)

@John Simmons (Deactivated)

https://hee-tis.atlassian.net/browse/TIS21-1316

@Marcello Fabbri (Unlicensed) @Edward Barclay

https://hee-tis.atlassian.net/browse/TIS21-1318

@Edward Barclay @Marcello Fabbri (Unlicensed)

https://hee-tis.atlassian.net/browse/TIS21-1319

@Andy Dingley


Lessons Learned

  •