Date |
|
Authors | |
Status | Documenting |
Summary | |
Impact |
|
Non-technical Description
There was an interruption that stopped TIS data being sent to Hicom via the NDW.
Trigger
Detection
Alerting in our monitoring channel
Resolution
Timeline
- 04:41 & 05:02 - Alerts in the NDW monitoring channel
- 09:06 - NDW succesfuly re-run on stage and prod
- ~midday - Agreed with HICOM less disruption by waiting for tomorrow rather than re-running jobs
Root Cause(s)
The Ansible job timed
The ETL kept retrying a failed chunk
All connections in the to NDW got closed
Action Items
Action Items | Owner |
---|---|
Improve Alerting from NDW ETLs (probably using Sentry) | |
Configure persistent logs from NDW ETLs | |
Improve Connection Pools: validation, eviction etc. |
Add Comment