Date |
|
Authors | Andy Nash (Unlicensed) Joseph (Pepe) Kelly John Simmons (Deactivated) |
Status | LiveDefect Complete. |
Summary | On Friday morning we received a monitoring alert that the NDW ETL (Prod) had failed overnight. Following an initial RCA, we discovered issues with the Blue server (one half of the load-balanced application, to which roughly 50% of traffic is directed) |
Impact | NDW not updated in Prod, roughly half our users could not access TIS, bulk upload not functioning, (current) Reval not functioning until ETL’s could be run. |
...
NDW ETL (Prod) failure alert on Slack
Reval / GMC sync ETL’s failure alert on Slack
Actions
[insert actions to take to mitigate this happening in future]
e.g.
keep everything more up to date to avoid major impacts of upgrades in future
ensure one person is not a single point of failure - required code reviews for infrastructure changes
specific changes to the architecture (list them) to improve resilience:
Use of ‘serverless’ technology: ECS, RDS, DocumentDB
leverage the AWS fault-tolerant infrastructure
decommission old reval as soon as it's possible to
check STAGE matches PROD upgrade (completed)
Lessons Learned (Good and Bad)
Good. The load-balanced infrastructure works and we were able to keep TIS mostly working while we performed the fixes.
Bad. More care to be taken with the commands being issued to the production server.
Repeatable playbooks, applied to non-production servers first.
Bad. Highlighted where we could do with more redundancy in the architecture.
Loadbalancer healthchecks aren’t exhaustive… could they be extended to match Uptime Robot or similar monitoring?exhaustive
more granular health checks are required (this will come with ECS)
Good. This has been applied to the TIS “2.0” Infrastructure roadmap.