Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 5 Next »

Date

Authors

Joseph (Pepe) Kelly Cai Willis

Status

Ongoing

Summary

Revalidation production login delay and logout redirecting to stage and localhost

Impact

An admin user was having issues to submit recommendation

Non-technical Description

The login process for the revalidation application was intermittently failing as it was unable to retrieve environment information (Production/Staging/etc).

Our applications live in the cloud - this means that another company (in this case Amazon Web Services) manages the physical machines that are running our applications and provides us with features that make our applications more robust and failure tolerant. In this case the relevant feature is “Availability Zones” - Our application lives in multiple data centres simultaneously - this means that if one of these data centres (and sometimes each Availability Zone is itself a group of data centres!) suffers even a catastrophic failure (e.g. fire, flooding or power loss) then the other instances of our application will still be running fine! AWS then manages how traffic is split between these zones so that this duplication is invisible to the user.

Our applications are also split into “MicroServices” - this means instead of a single program doing everything, we split up functionality into lots of smaller applications which talk to each other to fulfil some wider purpose - which are easier to maintain and provide some fault tolerance (e.g. if one stops working it doesn’t necessarily mean the whole of TIS stops working!)

The root cause of this issue is that some failure in a deployment meant that one of these Availability Zones was missing one of our applications, so when a user tried to login and their request was routed to this Availability Zone, the process would fail!

Fixing this was as simple as redeploying the affected application (a zero downtime operation).


Trigger


Detection

  • A user reported issues with submitting a recommendation and that they didn’t seem able to log in to the revalidation application


Resolution

  • Provided a temporary workaround for user (using incognito to prevent use of cached request)

  • Forced redeployment of tasks in ECS for tis-revalidation-core service


Timeline

BST unless otherwise stated

  • 09:20 - Responding to a query about submitting recommendations, we noticed that the login process was very slow

  • 11:20 - User was able to complete intended task using workaround (incognito mode)

  • 14:28 - Ruled out an issue with the applications after inspection of logs - focussed on the request itself

  • 11:00 - Mobbing the problem led to the realisation that two tis-revalidation-core tasks were in the same AZ

  • 11:30 - Forced redeployment tasks - issue resolved


Root Cause(s)

  • Requests for environment information to the tis-revalidation-core service were timing out

  • One of the AZ’s did not have a core service task running in alongside an integration service task

  • When the application failed to retrieve the environment information, it defaulted to staging/localhost


Action Items

Action Items

Owner

Set auto-scaling {with min/max the same value?}

Prod API Gateway needs to send metrics & logging sent on to CloudWatch

Monitor & Alert on something that indicates a problem, e.g. HTTP Errors/Latency in API Gateway?

TBD: Put the variable in a resource with wider availability (across availability zones)


Lessons Learned

  • No labels

0 Comments

You are not logged in. Any changes you make will be marked as anonymous. You may want to Log In if you already have an account.