Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Date

Authors

Andy Dingley Edward Barclay John Simmons (Deactivated)

Status

DocumentationDone

Summary

The TIS Self-Service site was inaccessible for a period of time

Impact

PGDiTs were unable to access/use the TSS site for a total of 3 hours split over two periods

...

During the reconfiguration separate issues were experienced on both the test and live sites which led to downtime on both

  • Test : ??? hours~1 hour

  • Live: 2.5 hours, following by a second 30 minute downtime

...

Trigger

  • Changes Manual changes made to S3, Cloudfront and Route53 configurations

...

Resolution

  • The S3 bucket and cloudfront CloudFront configurations were updated to stable values

...

  • : 11:32 - Work begins on URL redirection

  • : ??:?? - ??? manual changes?: 11:35 - Route53 DNS changes made

  • : 11:40 - Multiple Cloudfront changes made

  • : 11:45 - First notification from Uptime Robot about production downtime (11 minutes duration)

  • : ??12:?? - ??? manual changes?19 - Another manual Cloudfront deployment

  • : 12:23 - Second notification from Uptime Robot about production downtime (2 minutes duration)

  • : ??12:?? - ??? manual changes?25 - Manual Fix of Cloudfront origin location and permissions

  • : 12:29 - Last notification from Uptime Robot about production downtime

  • : 13:00 - Uptime Robot silenced

  • : 13:34 - Stage environment fixed by adding missing S3 bucket permissions and pointing Cloudfront at the correct bucket

  • : ??:?? - Production environment fixed by reverting the changes made in the AWS console

  • : 15:30 - Production deployment of tis-trainee-ui takes the site down again due to S3 bucket mismatch in Cloudfront config

  • : 16:05 - Production brought back up by updating the S3 bucket used by Cloudfront

  • : 1719 - Production changes re-applied successfully using Terraform

...

  • Why did the live site go down?

    • changes were made to route53 and Cloudfront to point them at the new location, but unfortunately the new locations were configured incorrectly

  • Why was the Cloudfront configuration wrong

    • The Cloudfront origin was manually changed as it looked like it was pointing at the old bucket/file location

  • Why was the CloudFront origin changed?

    • It was assumed that the old S3 location was not supposed to be used any more and there was an assumption that a new bucket had been created with the same prod files in it so that all of the new infrastructure was named the same.

  • Why did the new S3 location not work?

    • The site became active again but we found that due to a mis configuration, the new bucket that cloudFront was pointing at was holding stage information.

  • Why was the service manually ammended so many times?

    • A cascading list of manual changes happened, with no terraform configuration to revert to that had to be resolved before it would work again.

Live Site Downtime 2

  • Why did the live site go down again?

    • Cloudfront was using the original prod bucket, while GHA was deploying to the new bucket. When deployment updated the origin path (app version) used by Cloudfront it set a version that had never been built in the old bucket.

  • Why did the origin bucket not have the latest app version?

    • The UI deployment was broken on August 4th so hadn’t built the latest app version until fixed for the new bucket, the next successful build was a manually triggered build of main so the fact the app version had been incremented during workflow failure was missed.

  • Why was the deployment workflow broken?

    • GHA secrets had been set to incorrect values for the S3 buckets

  • Why was the deployment workflow issue not noticed and resolved?

    • ???

...

Action Items

Owner

...

Lessons Learned

  • Make sure before making any changes that terraform config exists and does what is expected (except for route53 configs).

  • When things dont go as expected, revert a single change back to a working configuration, and analyse what when wrong and not just keep making changes