Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

Date

Authors

Andy Nash (Unlicensed) Joseph (Pepe) Kelly John Simmons (Deactivated)

Status

LiveDefect In progress.

Summary

On Friday morning we received a monitoring alert that the NDW ETL (Prod) had failed overnight. Following an initial RCA, we discovered issues with the Blue server (one half of the load-balanced application, to which roughly 50% of traffic is directed)

Impact

NDW not updated in Prod, roughly half our users could not access TIS, bulk upload not functioning, (current) Reval not functioning

Timeline

??.?? Friday 13 November 2020

??.?? Friday 13 November 2020

??.?? Friday 13 November 2020

??.?? Friday 13 November 2020

Root Causes

  • Major version update to a version of one of our core infrastructure tools caused a failure in a dependent tool, with a resulting domino effect.

Trigger

  • .

Resolution

  • .

Detection

  • NDW ETL (Prod) failure alert on Slack

Actions

  • .

Lessons Learned (Good and Bad)

  • .

  • No labels