Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Date

Authors

Philip Wilsdon (Unlicensed)

Status

DocumentingDone

Summary

https://hee-tis.atlassian.net/browse/TIS21-1300

Impact

  • An increase in the number of helpdesk calls as people could not see trainees in Intrepid Leave Manager.   Some tableau reports would have been inaccurate in the morning.

...

There was an interruption that stopped TIS data being sent to Hicom via the NDW.

...

Trigger

  • Connection termination, probably due to NDW maintenance

...

Detection

  • Alerting in our monitoring channel

...

  • - 04:41 & 05:02 - Alerts in the NDW monitoring channel

  • - 08:30 - Notification to users on slack

  • - 09:06 - NDW succesfuly re-run on stage and prod

  • - 11:45 - NDW team confirm downstream

  • - ~midday - Agreed with HICOM less disruption by waiting for tomorrow rather than re-running jobs

  • - 10:50 - Confirmed that Leave Manager looks correct; standard checks are displaying normal results.

...

  • The Ansible job timed out.

  • The ETL kept retrying a failed chunk. Retries all failed as well:

  • All connections in the pool to NDW got closed and didn’t get re-created.

...

Action Items

Improve Alerting from NDW ETLs (probably using Sentry)

Action Items

Owner

Ansible retries (Park this for now?)

John Simmons (Deactivated)

https://hee-tis.atlassian.net/browse/TIS21-1316

Marcello Fabbri (Unlicensed) Edward Barclay

Configure persistent logs from NDW ETLs https://hee-tis.atlassian.net/browse/TIS21-13171318

Edward Barclay Marcello Fabbri (Unlicensed)

Improve Connection Pools: validation, eviction etc.?https://hee-tis.atlassian.net/browse/TIS21-1319

Andy Dingley

...

Lessons Learned