Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Next »

Date

Authors

Andy Dingley Edward Barclay John Simmons (Deactivated)

Status

Documentation

Summary

The TIS Self-Service site was inaccessible for a period of time

Impact

PGDiTs were unable to access/use the TSS site for a total of 3 hours split over two periods

Non-technical Description

The TSS team were applying some changes to the URLs used to access the TSS site, this change was needed due to ongoing confusion about which URL should be used for testing and live scenarios. The changes are as follows

  1. A new URL for the test site https://stage.trainee.tis.nhs.uk

  2. The existing https://trainee.tis.nhs.uk URL modified to direct to the live site as the new primary address

  3. The existing https://trainee.tis-selfservice.nhs.uk URL modified to redirect to https://trainee.tis.nhs.uk with a message to update bookmarks etc.

During the reconfiguration separate issues were experienced on both the test and live sites which led to downtime on both

  • Test: ??? hours

  • Live: 2.5 hours, following by a second 30 minute downtime


Trigger

  • Changes made to S3, Cloudfront and Route53 configurations


Detection

  • Uptime robot alerts

  • TSS team testing during modifications

  • Reports from LOs and PGDiTs


Resolution

  • The S3 bucket and cloudfront configurations were updated to stable values


Timeline

All times in BST unless indicated

  • : 11:32 - Work begins on URL redirection

  • : ??:?? - ??? manual changes?

  • : 11:45 - First notification from Uptime Robot about production downtime (11 minutes duration)

  • : ??:?? - ??? manual changes?

  • : 12:23 - Second notification from Uptime Robot about production downtime (2 minutes duration)

  • : ??:?? - ??? manual changes?

  • : 12:29 - Last notification from Uptime Robot about production downtime

  • : 13:00 - Uptime Robot silenced

  • : 13:34 - Stage environment fixed by adding missing S3 bucket permissions and pointing Cloudfront at the correct bucket

  • : ??:?? - Production environment fixed by reverting the changes made in the AWS console

  • : 15:30 - Production deployment of tis-trainee-ui takes the site down again due to S3 bucket mismatch in Cloudfront config

  • : 16:05 - Production brought back up by updating the S3 bucket used by Cloudfront

  • : 1719 - Production changes re-applied successfully using Terraform

Root Cause(s)

Test Site Downtime

  • Why did the test site go down?

    • S3 bucket permissions were misconfigured and did not allow Cloudfront to access the static website content.

  • Why were S3 bucket permissions misconfigured?

    • The bucket was created manually in the console, rather than using the existing Terraform config to make the changes (which would have included the required bucket config)

  • Why was Terraform not used?

    • ???

Live Site Downtime 1

  • Why did the live site go down?

    • ???

Live Site Downtime 2

  • Why did the live site go down again?

    • Cloudfront was using the original prod bucket, while GHA was deploying to the new bucket. When deployment updated the origin path (app version) used by Cloudfront it set a version that had never been built in the old bucket.

  • Why did the origin bucket not have the latest app version?

    • The UI deployment was broken on August 4th so hadn’t built the latest app version until fixed for the new bucket, the next successful build was a manually triggered build of main so the fact the app version had been incremented during workflow failure was missed.

  • Why was the deployment workflow broken?

    • GHA secrets had been set to incorrect values for the S3 buckets

  • Why was the deployment workflow issue not noticed and resolved?

    • ???


Action Items

Action Items

Owner


Lessons Learned

  • No labels