Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Date

Authors

Philip Wilsdon (Unlicensed)

Status

Documenting

Summary

https://hee-tis.atlassian.net/browse/TIS21-1300

Impact

  • An increase in the number of helpdesk calls as people could not see trainees in Intrepid Leave Manager.   Some tableau reports would have been inaccurate in the morning.

Table of Contents

Non-technical Description

...

  • Alerting in our monitoring channel

...

Resolution

  • Reran the ETLs (resolved some downstream impact)

  • Intrepid Leave Manager

...

Timeline

  • - 04:41 & 05:02 - Alerts in the NDW monitoring channel

  • - 09:06 - NDW succesfuly re-run on stage and prod

  • - ~midday - Agreed with HICOM less disruption by waiting for tomorrow rather than re-running jobs

  • - 10:50 - Confirmed that Leave Manager looks correct; standard checks are displaying normal results.

...

Root Cause(s)

  • The Ansible job timed out

  • The ETL kept retrying a failed chunk

  • All connections in the to NDW got closed

...

Action Items

Action Items

Owner

Improve Alerting from NDW ETLs (probably using Sentry) https://hee-tis.atlassian.net/browse/TIS21-1316

Marcello Fabbri (Unlicensed) Edward Barclay

Configure persistent logs from NDW ETLs https://hee-tis.atlassian.net/browse/TIS21-1317

Edward Barclay Marcello Fabbri (Unlicensed)

Improve Connection Pools: validation, eviction etc.?

...

Lessons Learned