Date |
|
Authors | |
Status | Working on it |
Summary | |
Impact |
Non-technical summary
Timeline
-09:48 AM | |
- 10:25 AM | Created ticket and incident page https://hee-tis.atlassian.net/browse/TISNEW-5728 |
Root Causes
Accidental major version update to one of our core infrastructure tools caused a failure in a dependent tool. This was ok in itself with the containers that were running, but no new containers could launch, ie ETL’s or newly deployed software versions.
Trigger
Accidentally upgrading rather than updating the servers that the TIS infrastructure run on.
Resolution
Remove the server to be worked on from the load balancer so that all inbound TIS traffic is diverted to the working server.
Stop all docker containers from running
Restart the server so all of the upgrades/updates could apply correctly
Remove the old containers
Remove the docker networks associated with each of those containers
Applied network fixes to move the network range to a non-overlapping range that wasn't in use with the new AWS infrastructure.
Restarted all of the containers in sequence. This had to be done in series as, while this process was happening (approximately 1 hour per server), the other server in our infrastructure was in charge of keeping TIS up for everyone else.
Check all services are up and working
Allow load balancer to send traffic to the fixed server
Rinse and repeat with the second server.
Detection
NDW ETL (Prod) failure alert on Slack
Reval / GMC sync ETL’s failure alert on Slack
Actions
[insert actions to take to mitigate this happening in future]
e.g.
keep everything more up to date to avoid major impacts of upgrades in future
ensure one person is not a single point of failure - required code reviews for infrastructure changes
specific changes to the architecture (list them) to improve resilience:
Use of ‘serverless’ technology: ECS, RDS, DocumentDB
check STAGE matches PROD upgrade
Lessons Learned (Good and Bad)
Good. The load-balanced infrastructure works and we were able to keep TIS mostly working while we performed the fixes.
Bad. More care to be taken with the commands being issued to the production server.
Repeatable playbooks, applied to non-production servers first.
Bad. Highlighted where we could do with more redundancy in the architecture.
Loadbalancer healthchecks aren’t exhaustive… could they be extended to match Uptime Robot or similar monitoring?
Good. This has been applied to the TIS “2.0” Infrastructure roadmap.
Add Comment