Date |
|
Authors | Andy Nash (Unlicensed) Joseph (Pepe) Kelly John Simmons (Deactivated) |
Status | LiveDefect In progress. |
Summary | On Friday morning we received a monitoring alert that the NDW ETL (Prod) had failed overnight. Following an initial RCA, we discovered issues with the Blue server (one half of the load-balanced application, to which roughly 50% of traffic is directed) |
Impact | NDW not updated in Prod, roughly half our users could not access TIS, bulk upload not functioning, (current) Reval not functioning |
Timeline
??.?? Friday 13 November 2020 | |
??.?? Friday 13 November 2020 | |
??.?? Friday 13 November 2020 | |
??.?? Friday 13 November 2020 |
Root Causes
Major version update to a version of one of our core infrastructure tools caused a failure in a dependent tool, with a resulting domino effect.
Trigger
.
Resolution
.
Detection
NDW ETL (Prod) failure alert on Slack
Actions
.
Lessons Learned (Good and Bad)
.
Add Comment