Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 6 Next »

Date

Authors

Philip Wilsdon (Unlicensed)

Status

Documenting

Summary

https://hee-tis.atlassian.net/browse/TIS21-1300

Impact

  • An increase in the number of helpdesk calls as people could not see trainees in Intrepid Leave Manager.   Some tableau reports would have been inaccurate in the morning.

Non-technical Description

There was an interruption that stopped TIS data being sent to Hicom via the NDW.


Trigger


Detection

  • Alerting in our monitoring channel


Resolution

  • Reran the ETLs (resolved some downstream impact)

  • Intrepid Leave Manager


Timeline

  • - 04:41 & 05:02 - Alerts in the NDW monitoring channel

  • - 09:06 - NDW succesfuly re-run on stage and prod

  • - ~midday - Agreed with HICOM less disruption by waiting for tomorrow rather than re-running jobs

  • - 10:50 - Confirmed that Leave Manager looks correct; standard checks are displaying normal results.


Root Cause(s)

  • The Ansible job timed out

  • The ETL kept retrying a failed chunk

  • All connections to NDW got closed


Action Items

Action Items

Owner

Ansible retries

Improve Alerting from NDW ETLs (probably using Sentry) https://hee-tis.atlassian.net/browse/TIS21-1316

Marcello Fabbri (Unlicensed) Edward Barclay

Configure persistent logs from NDW ETLs https://hee-tis.atlassian.net/browse/TIS21-1317

Edward Barclay Marcello Fabbri (Unlicensed)

Improve Connection Pools: validation, eviction etc.?


Lessons Learned

  • No labels

0 Comments

You are not logged in. Any changes you make will be marked as anonymous. You may want to Log In if you already have an account.