Page Comparison

Date	13 Nov 2020
Authors	Andy Nash (Unlicensed) Joseph (Pepe) Kelly John Simmons (Deactivated)
Status	LiveDefect Complete.
Summary	On Friday morning we received a monitoring alert that the NDW ETL (Prod) had failed overnight. Following an initial RCA, we discovered issues with the Blue server (one half of the load-balanced application, to which roughly 50% of traffic is directed)
Impact	NDW not updated in Prod, roughly half our users could not access TIS, bulk upload not functioning, (current) Reval not functioning until ETL’s could be run.

...

Actions

[insert actions to take to mitigate this happening in future]
e.g.
keep everything more up to date to avoid major impacts of upgrades in future
ensure one person is not a single point of failure - required code reviews for infrastructure changes
specific changes to the architecture (list them) to improve resilience:
- Use of ‘serverless’ technology: ECS, RDS, DocumentDB
- leverage the AWS fault-tolerant infrastructure
- decommission old reval as soon as it's possible to
check STAGE matches PROD upgrade (completed)

Good. The load-balanced infrastructure works and we were able to keep TIS mostly working while we performed the fixes.
Bad. More care to be taken with the commands being issued to the production server.
- Repeatable playbooks, applied to non-production servers first.
Bad. Highlighted where we could do with more redundancy in the architecture.
- Loadbalancer healthchecks aren’t exhaustive… could they be extended to match Uptime Robot or similar monitoring?exhaustive
- more granular health checks are required (this will come with ECS)
Good. This has been applied to the TIS “2.0” Infrastructure roadmap.