Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Date

Authors

Andy Nash (Unlicensed) Joseph (Pepe) Kelly John Simmons (Deactivated)

Status

LiveDefect Complete.

Summary

On Friday morning we received a monitoring alert that the NDW ETL (Prod) had failed overnight. Following an initial RCA, we discovered issues with the Blue server (one half of the load-balanced application, to which roughly 50% of traffic is directed)

Impact

NDW not updated in Prod, roughly half our users could not access TIS, bulk upload not functioning, (current) Reval not functioning until ETL’s could be run.

...

  • NDW ETL (Prod) failure alert on Slack

  • Reval / GMC sync ETL’s failure alert on Slack

Actions

  • [insert actions to take to mitigate this happening in future]

  • e.g.

  • keep everything more up to date to avoid major impacts of upgrades in future

  • ensure one person is not a single point of failure - required code reviews for infrastructure changes

  • specific changes to the architecture (list them) to improve resilience:

    • Use of ‘serverless’ technology: ECS, RDS, DocumentDB

    • leverage the AWS fault-tolerant infrastructure

    • decommission old reval as soon as it's possible to

  • check STAGE matches PROD upgrade (completed)

Lessons Learned (Good and Bad)

  • Good. The load-balanced infrastructure works and we were able to keep TIS mostly working while we performed the fixes.

  • Bad. More care to be taken with the commands being issued to the production server.

    • Repeatable playbooks, applied to non-production servers first.

  • Bad. Highlighted where we could do with more redundancy in the architecture.

    • Loadbalancer healthchecks aren’t exhaustive… could they be extended to match Uptime Robot or similar monitoring?exhaustive

    • more granular health checks are required (this will come with ECS)

  • Good. This has been applied to the TIS “2.0” Infrastructure roadmap.