Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Date

Authors

Status

Working on it

Summary

Impact

...

-09:48 AM

- 10:25 AM

Created ticket and incident page https://hee-tis.atlassian.net/browse/TISNEW-5728

2020-11-18 Reval Legacy/Old GMC Sync

Root Causes

  • Accidental major version update to one of our core infrastructure tools caused a failure in a dependent tool. This was ok in itself with the containers that were running, but no new containers could launch, ie ETL’s or newly deployed software versions.

Trigger

  • Accidentally upgrading rather than updating the servers that the TIS infrastructure run on.

Resolution

  • Remove the server to be worked on from the load balancer so that all inbound TIS traffic is diverted to the working server.

  • Stop all docker containers from running

  • Restart the server so all of the upgrades/updates could apply correctly

  • Remove the old containers

  • Remove the docker networks associated with each of those containers

  • Applied network fixes to move the network range to a non-overlapping range that wasn't in use with the new AWS infrastructure.

  • Restarted all of the containers in sequence. This had to be done in series as, while this process was happening (approximately 1 hour per server), the other server in our infrastructure was in charge of keeping TIS up for everyone else.

  • Check all services are up and working

  • Allow load balancer to send traffic to the fixed server

  • Rinse and repeat with the second server.

Detection

  • NDW ETL (Prod) failure alert on Slack

  • Reval / GMC sync ETL’s failure alert on Slack

Actions

...

[insert actions to take to mitigate this happening in future]

...

e.g.

...

keep everything more up to date to avoid major impacts of upgrades in future

...

ensure one person is not a single point of failure - required code reviews for infrastructure changes

...

specific changes to the architecture (list them) to improve resilience:

  • Use of ‘serverless’ technology: ECS, RDS, DocumentDB

...

Trigger

  • User reported in Teams Support Channel

Resolution

Detection

  • User reported in Teams Support Channel

Actions

Lessons Learned (Good and Bad)

  • Good. The load-balanced infrastructure works and we were able to keep TIS mostly working while we performed the fixes.

  • Bad. More care to be taken with the commands being issued to the production server.

    • Repeatable playbooks, applied to non-production servers first.

  • Bad. Highlighted where we could do with more redundancy in the architecture.

    • Loadbalancer healthchecks aren’t exhaustive… could they be extended to match Uptime Robot or similar monitoring?

  • Good. This has been applied to the TIS “2.0” Infrastructure roadmap.G