Documentation

Date	08 Aug 2023
Authors	Andy Dingley Edward Barclay John Simmons (Deactivated)
Status	Done
Summary	The TIS Self-Service site was inaccessible for a period of time
Impact	PGDiTs were unable to access/use the TSS site for a total of 3 hours split over two periods

During the reconfiguration separate issues were experienced on both the test and live sites which led to downtime on both

...

Trigger

...

Resolution

The S3 bucket and cloudfront CloudFront configurations were updated to stable values

...

08 Aug 2023: 11:32 - Work begins on URL redirection
08 Aug 2023: ??:?? - ??? manual changes?: 11:35 - Route53 DNS changes made
08 Aug 2023: 11:40 - Multiple Cloudfront changes made
08 Aug 2023: 11:45 - First notification from Uptime Robot about production downtime (11 minutes duration)
08 Aug 2023: ??12:?? - ??? manual changes?19 - Another manual Cloudfront deployment
08 Aug 2023: 1112:23 - Second notification from Uptime Robot about production downtime (2 minutes duration)
08 Aug 2023: ??12:?? - ??? manual changes?25 - Manual Fix of Cloudfront origin location and permissions
08 Aug 2023: 12:29 - Last notification from Uptime Robot about production downtime
08 Aug 2023: 13:00 - Uptime Robot silenced
08 Aug 2023: 13:34 - Stage environment fixed by adding missing S3 bucket permissions and pointing Cloudfront at the correct bucket
08 Aug 2023: ??:?? - Production environment fixed by reverting the changes made in the AWS console
08 Aug 2023: 15:30 - Production deployment of tis-trainee-ui takes the site down again due to S3 bucket mismatch in Cloudfront config
08 Aug 2023: 16:05 - Production brought back up by updating the S3 bucket used by Cloudfront
08 Aug 2023: 1719 - Production changes re-applied successfully using Terraform

...

Why did the live site go down?
- changes were made to route53 and Cloudfront to point them at the new location, but unfortunately the new locations were configured incorrectly
Why was the Cloudfront configuration wrong
- The Cloudfront origin was manually changed as it looked like it was pointing at the old bucket/file location
Why was the CloudFront origin changed?
- It was assumed that the old S3 location was not supposed to be used any more and there was an assumption that a new bucket had been created with the same prod files in it so that all of the new infrastructure was named the same.
Why did the new S3 location not work?
- The site became active again but we found that due to a mis configuration, the new bucket that cloudFront was pointing at was holding stage information.
Why was the service manually ammended so many times?
- A cascading list of manual changes happened, with no terraform configuration to revert to that had to be resolved before it would work again.

Why did the live site go down again?
- Cloudfront was using the original prod bucket, while GHA was deploying to the new bucket. When deployment updated the origin path (app version) used by Cloudfront it set a version that had never been built in the old bucket.
Why did the origin bucket not have the latest app version?
- The UI deployment was broken on August 4th so hadn’t built the latest app version until fixed for the new bucket, the next successful build was a manually triggered build of main so the fact the app version had been incremented during workflow failure was missed.
Why was the deployment workflow broken?
- GHA secrets had been set to incorrect values for the S3 buckets
Why was the deployment workflow issue not noticed and resolved?
- ???

...

...

Make sure before making any changes that terraform config exists and does what is expected (except for route53 configs).
When things dont go as expected, revert a single change back to a working configuration, and analyse what when wrong and not just keep making changes