Date |
|
Authors | Andy Nash (Unlicensed) Joseph (Pepe) Kelly John Simmons (Deactivated) |
Status | LiveDefect In progress. |
Summary | On Friday morning we received a monitoring alert that the NDW ETL (Prod) had failed overnight. Following an initial RCA, we discovered issues with the Blue server (one half of the load-balanced application, to which roughly 50% of traffic is directed) |
Impact | NDW not updated in Prod, roughly half our users could not access TIS, bulk upload not functioning, (current) Reval not functioning |
Non-technical summary
We were tidying up some underlying infrastructure maintenance work (following on from the previous recent major incident).
In the course of this, we upgraded an element, instead of updating it, and the unplanned upgrade had some knock on impact - essentially rendering TIS inaccessible on one of the two load-balanced servers.
Correcting this took a fair bit of time, so whilst we were affecting that change, we temporarily routed all traffic to the functioning server. That server had some limitations compared to normal access on TIS - Reval was unavailable and Bulk Uploads didn’t work.
Timeline
Thursday 12 November 2020 | John Simmons (Deactivated) Fixed Blue/Green aptitude repository settings on the blue and green servers |
Thursday 12 November 2020 | John Simmons (Deactivated) Accidentally ran the wrong command to update the servers list (apt-get upgrade instead of apt-get update) |
Friday 13 November 2020 | When new docker containers tried to start, they failed (ndw etl and reval etls) |
Friday 13 November 2020 | docker was restarted on green to try and get the services working, but this caused all services to not start back up again. (during this time blue server was working fine) |
Friday 13 November 2020 | found that docker had upgraded to the latest version (1 full version) but docker compose by the way it works hadnt been updated to match (this was then updated the the latest version as well) |
Friday 13 November 2020 | Joseph (Pepe) Kelly removed green server from the loadbalancer to stop people accessing it and get traffic flowing normally to blue |
Friday 13 November 2020 | due to some of the changes in the new version of docker, we had to delete all of the containers and the associated docker networks. |
Friday 13 November 2020 | we applied a fix to change the default ip address ranges docker can use for cross container traffic |
Friday 13 November 2020 | then had to restart all containers to get the services back then let green back into the loadbalancer to enable full traffic flow. |
Friday 13 November 2020 | planned to fix blue during out of hours |
Friday 13 November 2020 @10:54am | TCS was deployed into prod, and this killed the blue server |
Friday 13 November 2020 | removed blue server from loadbalancer to allow user to continue to use TIS |
Friday 13 November 2020 @ 3:00pm | performed the same operation that was applied to green, to the blue server and made sure all services were working well |
Friday 13 November 2020 @4:00pm | TIS returned to fully working status |
Root Causes
Major version update to a version of one of our core infrastructure tools caused a failure in a dependent tool, with a resulting domino effect.
Trigger
Accidently upgrading rather than updating a part of the underlying TIS infrastructure.
Resolution
Upgrading all dependent services on both of the two load-balanced servers sequentially, while forcing all traffic to the functioning server while doing so.
Detection
NDW ETL (Prod) failure alert on Slack
Actions
.
Lessons Learned (Good and Bad)
.
Add Comment