Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

Date

Authors

Status

Working on it

Summary

Impact

Non-technical summary

Timeline

-09:48 AM

- 10:25 AM

Created ticket and incident page https://hee-tis.atlassian.net/browse/TISNEW-5728

Root Causes

  • Accidental major version update to one of our core infrastructure tools caused a failure in a dependent tool. This was ok in itself with the containers that were running, but no new containers could launch, ie ETL’s or newly deployed software versions.

Trigger

  • Accidentally upgrading rather than updating the servers that the TIS infrastructure run on.

Resolution

  • Remove the server to be worked on from the load balancer so that all inbound TIS traffic is diverted to the working server.

  • Stop all docker containers from running

  • Restart the server so all of the upgrades/updates could apply correctly

  • Remove the old containers

  • Remove the docker networks associated with each of those containers

  • Applied network fixes to move the network range to a non-overlapping range that wasn't in use with the new AWS infrastructure.

  • Restarted all of the containers in sequence. This had to be done in series as, while this process was happening (approximately 1 hour per server), the other server in our infrastructure was in charge of keeping TIS up for everyone else.

  • Check all services are up and working

  • Allow load balancer to send traffic to the fixed server

  • Rinse and repeat with the second server.

Detection

  • NDW ETL (Prod) failure alert on Slack

  • Reval / GMC sync ETL’s failure alert on Slack

Actions

  • [insert actions to take to mitigate this happening in future]

  • e.g.

  • keep everything more up to date to avoid major impacts of upgrades in future

  • ensure one person is not a single point of failure - required code reviews for infrastructure changes

  • specific changes to the architecture (list them) to improve resilience:

    • Use of ‘serverless’ technology: ECS, RDS, DocumentDB

  • check STAGE matches PROD upgrade

Lessons Learned (Good and Bad)

  • Good. The load-balanced infrastructure works and we were able to keep TIS mostly working while we performed the fixes.

  • Bad. More care to be taken with the commands being issued to the production server.

    • Repeatable playbooks, applied to non-production servers first.

  • Bad. Highlighted where we could do with more redundancy in the architecture.

    • Loadbalancer healthchecks aren’t exhaustive… could they be extended to match Uptime Robot or similar monitoring?

  • Good. This has been applied to the TIS “2.0” Infrastructure roadmap.

  • No labels