Date |
|
Authors | |
Status | Documenting |
Summary | |
Impact |
|
Non-technical Description
There was an interruption that stopped TIS data being sent to Hicom via the NDW.
Trigger
Detection
Alerting in our monitoring channel
Resolution
Reran the ETLs (resolved some downstream impact)
Intrepid Leave Manager
Timeline
- 04:41 & 05:02 - Alerts in the NDW monitoring channel
- 09:06 - NDW succesfuly re-run on stage and prod
- ~midday - Agreed with HICOM less disruption by waiting for tomorrow rather than re-running jobs
- 10:50 - Confirmed that Leave Manager looks correct; standard checks are displaying normal results.
Root Cause(s)
The Ansible job timed out
The ETL kept retrying a failed chunk
All connections to NDW got closed
Action Items
Action Items | Owner |
---|---|
Ansible retries | |
Improve Alerting from NDW ETLs (probably using Sentry) https://hee-tis.atlassian.net/browse/TIS21-1316 | |
Configure persistent logs from NDW ETLs https://hee-tis.atlassian.net/browse/TIS21-1317 | |
Improve Connection Pools: validation, eviction etc.? |
Add Comment