Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Non-technical Description

The login process for the revalidation application was intermittently failing as it was unable to retrieve environment information (Production/Staging/etc).

Our applications live in the cloud - this means that another company (in this case Amazon Web Services) manages the physical machines that are running our applications and provides us with features that make our applications more robust and failure tolerant. In this case the relevant feature is “Availability Zones” - Our application lives in multiple data centres simultaneously - this means that if one of these data centres (and sometimes each Availability Zone is itself a group of data centres!) suffers even a catastrophic failure (e.g. fire, flooding or power loss) then the other instances of our application will still be running fine! AWS then manages how traffic is split between these zones so that this duplication is invisible to the user.

Our applications are also split into “MicroServices” - this means instead of a single program doing everything, we split up functionality into lots of smaller applications which talk to each other to fulfil some wider purpose - which are easier to maintain and provide some fault tolerance (e.g. if one stops working it doesn’t necessarily mean the whole of TIS stops working!)

The root cause of this issue is that some failure in a deployment meant that one of these Availability Zones was missing one of our applications, so when a user tried to login and their request was routed to this Availability Zone, the process would fail!

Fixing this was as simple as redeploying the affected application (a zero downtime operation).

...

Trigger

...

Detection

  • A user reported issues with submitting a recommendation and that they didn’t seem able to log in to the revalidation application

...

Resolution

  • Provided a temporary workaround for user (using incognito to prevent use of cached request)

  • Forced redeployment of tasks in ECS for tis-revalidation-core service

...

Timeline

BST unless otherwise stated

  • ?? 09:?? 20 - Responding to a query about submitting recommendations, we noticed that the login process was very slow

  • ??:?? - 11:20 - User was able to complete intended task using workaround (incognito mode)

  • 14:28 - Ruled out an issue with the applications after inspection of logs - focussed on the request itself

  • ??:?? - 11:00 - Mobbing the problem led to the realisation that two tis-revalidation-core tasks were in the same AZ

  • 11:30 - Forced redeployment tasks - issue resolved

...

Root Cause(s)

  • Request to taking a long time but not always

  • The request for a string value goes through a number of services

  • Sometimes get redirected to stageRequests for environment information to the tis-revalidation-core service were timing out

  • One of the AZ’s did not have a core service task running in alongside an integration service task

  • When the application failed to retrieve the environment information, it defaulted to staging/localhost

...

Action Items

Action Items

Owner

Set auto-scaling {with min/max the same value?}

Prod API Gateway needs to send metrics & logging sent on to CloudWatch

Monitor & Alert on something that indicates a problem, e.g. HTTP Errors/Latency in API Gateway?

TBD: Put the variable in a resource with wider availability (across availability zones)

...