2020-11-13 NDW and Reval ETL failures
Date | Nov 13, 2020 |
Authors | @Andy Nash (Unlicensed) @Joseph (Pepe) Kelly @John Simmons (Deactivated) |
Status | LiveDefect Complete. |
Summary | On Friday morning we received a monitoring alert that the NDW ETL (Prod) had failed overnight. Following an initial RCA, we discovered issues with the Blue server (one half of the load-balanced application, to which roughly 50% of traffic is directed) |
Impact | NDW not updated in Prod, roughly half our users could not access TIS, bulk upload not functioning, (current) Reval not functioning until ETL’s could be run. |
Non-technical summary
While working on a ticket that was related the Azure to Amazon migration, some of the servers needed an update so that they could actually perform a software upgrade when needed as there were some out of date references to locations on the internet that no longer worked/existed, and this was stopping the updating from working.
In the course of this, we fixed these errors, but performed an upgrade by mistake to the whole virtual server, instead of just updating its list of software to include security and bug fixes. The unplanned upgrade had some knock-on impact on our ability to run components of TIS. This was only noticed the next day when the ETL’s tried to run.
Correcting this took a fair bit of time and initially the failure spread to more components as we started to rectify the situation, essentially rendering TIS inaccessible on one of the two load-balanced servers. So whilst we were affecting that change, we temporarily routed everyone to the functioning server. That server had some limitations compared to normal access on TIS - Bulk Uploads wasn’t available and Reval was unavailable due to the ETL’s not running overnight. After the first machine was fixed, both the NDW ETL and the Reval ETL were run successfully.
A later, much shorter incidence occurred when we tried to release new functionality of the TIS application. Later that day we scheduled in the fix to the other virtual server, so as to get back to a fully working system as soon as possible but not interfere with the users of TIS going about there normal daily business.
Timeline
Thursday 12 November 2020 | @John Simmons (Deactivated) Fixed Blue/Green aptitude repository settings on the blue and green servers |
Thursday 12 November 2020 | @John Simmons (Deactivated) Accidentally ran the wrong command to update the servers list (apt-get upgrade instead of apt-get update) |
Friday 13 November 2020 | When new docker containers tried to start, they failed (NDW ETL and Reval ETLs) |
Friday 13 November 2020 | @Joseph (Pepe) Kelly Docker was restarted on green to try and get the services working, but this caused all services to not start back up again. (during this time blue server was working fine) |
Friday 13 November 2020 | Found that docker had upgraded to the latest version (1 full version) but docker compose by the way it works hadn't been updated to match (this was then updated the the latest version as well) |
Friday 13 November 2020 | @Joseph (Pepe) Kelly removed green server from the loadbalancer to stop people accessing it and get traffic flowing normally to blue |
Friday 13 November 2020 | Due to some of the changes in the new version of docker, we had to delete all of the containers and the associated docker networks. |
Friday 13 November 2020 | We applied a fix to change the default ip address ranges docker can use for cross container traffic |
Friday 13 November 2020 | Then had to restart all containers to get the services back then let green back into the loadbalancer to enable full traffic flow. All services restored. |
Friday 13 November 2020 | Planned to fix blue during out of hours |
Friday 13 November 2020 @10:54am | TCS was deployed into prod, and this killed the blue server (TCS only) |
Friday 13 November 2020 | Removed blue server from loadbalancer to allow user to continue to use TIS |
Friday 13 November 2020 @ 3:00pm | Performed the same operation that was applied to green, to the blue server and made sure all services were working well |
Friday 13 November 2020 @4:00pm | TIS returned to fully working status |
Root Causes
Accidental major version update to one of our core infrastructure tools caused a failure in a dependent tool. This was ok in itself with the containers that were running, but no new containers could launch, ie ETL’s or newly deployed software versions.
Trigger
Accidentally upgrading rather than updating the servers that the TIS infrastructure run on.
Resolution
Remove the server to be worked on from the load balancer so that all inbound TIS traffic is diverted to the working server.
Stop all docker containers from running
Restart the server so all of the upgrades/updates could apply correctly
Remove the old containers
Remove the docker networks associated with each of those containers
Applied network fixes to move the network range to a non-overlapping range that wasn't in use with the new AWS infrastructure.
Restarted all of the containers in sequence. This had to be done in series as, while this process was happening (approximately 1 hour per server), the other server in our infrastructure was in charge of keeping TIS up for everyone else.
Check all services are up and working
Allow load balancer to send traffic to the fixed server
Rinse and repeat with the second server.
Detection
NDW ETL (Prod) failure alert on Slack
Reval / GMC sync ETL’s failure alert on Slack
Actions
keep everything more up to date to avoid major impacts of upgrades in future
ensure one person is not a single point of failure - required code reviews for infrastructure changes
specific changes to the architecture to improve resilience:
Use of ‘serverless’ technology: ECS, RDS, DocumentDB
leverage the AWS fault-tolerant infrastructure
decommission old reval as soon as it's possible to
check STAGE matches PROD upgrade (completed)
Lessons Learned (Good and Bad)
Good. The load-balanced infrastructure works and we were able to keep TIS mostly working while we performed the fixes.
Bad. More care to be taken with the commands being issued to the production server.
Repeatable playbooks, applied to non-production servers first.
Bad. Highlighted where we could do with more redundancy in the architecture.
Loadbalancer healthchecks aren’t exhaustive
more granular health checks are required (this will come with ECS)
Good. This has been applied to the TIS “2.0” Infrastructure roadmap.
Slack: https://hee-nhs-tis.slack.com/
Jira issues: https://hee-tis.atlassian.net/issues/?filter=14213