Date |
|
Authors | |
Status | DocumentingDone |
Summary | |
Impact |
|
Table of Contents |
---|
Non-technical Description
There was an interruption that stopped TIS data being sent to Hicom via the NDW.
...
Trigger
Connection termination, probably due to NDW maintenance
...
Detection
Alerting in our monitoring channel
...
Resolution
Reran the ETLs (resolved some downstream impact)
Intrepid Leave Manager
...
Timeline
- 04:41 & 05:02 - Alerts in the NDW monitoring channel
- 08:30 - Notification to users on slack
- 09:06 - NDW succesfuly re-run on stage and prod
- 11:45 - NDW team confirm downstream
- ~midday - Agreed with HICOM less disruption by waiting for tomorrow rather than re-running jobs
- 10:50 - Confirmed that Leave Manager looks correct; standard checks are displaying normal results.
...
Root Cause(s)
The Ansible job timed out.
The ETL kept retrying a failed chunk. Retries all failed as well:
All connections in the pool to NDW got closed and didn’t get re-created.
...
Action Items
Action Items | Owner | Improve Alerting from NDW ETLs (probably using Sentry)|
---|---|---|
Ansible retries (Park this for now?) | ||
Marcello Fabbri (Unlicensed) Edward Barclay Configure persistent logs from NDW ETLs | ||
Improve Connection Pools: validation, eviction etc.https://hee-tis.atlassian.net/browse/TIS21-1319 |
...