Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

There was an interruption that stopped TIS data being sent to Hicom via the NDW.

...

Trigger

  • Connection termination, probably due to NDW maintenance

...

Detection

  • Alerting in our monitoring channel

...

  • - 04:41 & 05:02 - Alerts in the NDW monitoring channel

  • - 08:30 - Notification to users on slack

  • - 09:06 - NDW succesfuly re-run on stage and prod

  • - 11:45 - NDW team confirm downstream

  • - ~midday - Agreed with HICOM less disruption by waiting for tomorrow rather than re-running jobs

  • - 10:50 - Confirmed that Leave Manager looks correct; standard checks are displaying normal results.

...

  • The Ansible job timed out.

  • The ETL kept retrying a failed chunk. Retries all failed as well:

  • All connections in the pool to NDW got closed and didn’t get re-created.

...

Action Items

Action Items

Owner

Ansible retries Improve Alerting from NDW ETLs (probably using Sentry) (Park this for now?)

John Simmons (Deactivated)

https://hee-tis.atlassian.net/browse/TIS21-1316

Marcello Fabbri (Unlicensed) Edward Barclay

Configure persistent logs from NDW ETLs https://hee-tis.atlassian.net/browse/TIS21-13171318

Edward Barclay Marcello Fabbri (Unlicensed)

Improve Connection Pools: validation, eviction etc.?https://hee-tis.atlassian.net/browse/TIS21-1319

Andy Dingley

...

Lessons Learned