Date |
|
Authors | |
Status | Documenting |
Summary | |
Impact |
|
Table of Contents |
---|
Non-technical Description
...
Alerting in our monitoring channel
...
Resolution
Reran the ETLs (resolved some downstream impact)
Intrepid Leave Manager
...
Timeline
- 04:41 & 05:02 - Alerts in the NDW monitoring channel
- 09:06 - NDW succesfuly re-run on stage and prod
- ~midday - Agreed with HICOM less disruption by waiting for tomorrow rather than re-running jobs
- 10:50 - Confirmed that Leave Manager looks correct; standard checks are displaying normal results.
...
Root Cause(s)
The Ansible job timed out
The ETL kept retrying a failed chunk
All connections in the to NDW got closed
...
Action Items
Action Items | Owner |
---|---|
Improve Alerting from NDW ETLs (probably using Sentry) https://hee-tis.atlassian.net/browse/TIS21-1316 | |
Configure persistent logs from NDW ETLs https://hee-tis.atlassian.net/browse/TIS21-1317 | |
Improve Connection Pools: validation, eviction etc.? |
...