Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 4 Next »

Date

Authors

Philip Wilsdon (Unlicensed)

Status

Documenting

Summary

https://hee-tis.atlassian.net/browse/TIS21-1300

Impact

  • An increase in the number of helpdesk calls as people could not see trainees in Intrepid Leave Manager.  

Non-technical Description

There was an interruption that stopped TIS data being sent to Hicom via the NDW.


Trigger


Detection

  • Alerting in our monitoring channel


Resolution


Timeline

  • - 04:41 & 05:02 - Alerts in the NDW monitoring channel

  • - 09:06 - NDW succesfuly re-run on stage and prod

  • - ~midday - Agreed with HICOM less disruption by waiting for tomorrow rather than re-running jobs


Root Cause(s)

  • The Ansible job timed

  • The ETL kept retrying a failed chunk

  • All connections in the to NDW got closed


Action Items

Action Items

Owner

Improve Alerting from NDW ETLs (probably using Sentry)

Marcello Fabbri (Unlicensed) Edward Barclay

Configure persistent logs from NDW ETLs

Edward Barclay Marcello Fabbri (Unlicensed)

Improve Connection Pools: validation, eviction etc.


Lessons Learned

  • No labels