Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 7 Current »

Date

Authors

Philip Wilsdon (Unlicensed)

Status

Documenting

Summary

https://hee-tis.atlassian.net/browse/TIS21-1300

Impact

  • An increase in the number of helpdesk calls as people could not see trainees in Intrepid Leave Manager.   Some tableau reports would have been inaccurate in the morning.

Non-technical Description

There was an interruption that stopped TIS data being sent to Hicom via the NDW.


Trigger

  • Connection termination, probably due to NDW maintenance


Detection

  • Alerting in our monitoring channel


Resolution

  • Reran the ETLs (resolved some downstream impact)

  • Intrepid Leave Manager


Timeline

  • - 04:41 & 05:02 - Alerts in the NDW monitoring channel

  • - 08:30 - Notification to users on slack

  • - 09:06 - NDW succesfuly re-run on stage and prod

  • - 11:45 - NDW team confirm downstream

  • - ~midday - Agreed with HICOM less disruption by waiting for tomorrow rather than re-running jobs

  • - 10:50 - Confirmed that Leave Manager looks correct; standard checks are displaying normal results.


Root Cause(s)

  • The Ansible job timed out.

  • The ETL kept retrying a failed chunk. Retries all failed as well:

  • All connections in the pool to NDW got closed and didn’t get re-created.


Action Items


Lessons Learned

  • No labels