...
Non-technical Description
There was an interruption that stopped TIS data being sent to Hicom via the NDW.
...
Trigger
...
Detection
Alerting in our monitoring channel
...
- 04:41 & 05:02 - Alerts in the NDW monitoring channel
- 09:06 - NDW succesfuly re-run on stage and prod
- ~midday - Agreed with HICOM less disruption by waiting for tomorrow rather than re-running jobs
...
Root Cause(s)
WThe Ansible job timed
The ETL kept retrying a failed chunk
All connections in the to NDW got closed
...
Action Items
Action Items | Owner |
---|
Lessons Learned
...
Improve Alerting from NDW ETLs (probably using Sentry) | |
Configure persistent logs from NDW ETLs | |
Improve Connection Pools: validation, eviction etc. |
...