Date	18 Aug 2022
Authors	Joseph (Pepe) Kelly Cai Willis
Status	Ongoing
Summary	Revalidation production login delay and logout redirecting to stage and localhost
Impact	An admin user was having issues to submit recommendation

Non-technical Description

The login process for the revalidation application was intermittently failing as it was unable to retrieve environment information (Production/Staging/etc).

Our applications live in the cloud - this means that another company (in this case Amazon Web Services) manages the physical machines that are running our applications and provides us with features that make our applications more robust and failure tolerant. In this case the relevant feature is “Availability Zones” - Our application lives in multiple data centres simultaneously - this means that if one of these data centres (and sometimes each Availability Zone is itself a group of data centres!) suffers even a catastrophic failure (e.g. fire, flooding or power loss) then the other instances of our application will still be running fine! AWS then manages how traffic is split between these zones so that this duplication is invisible to the user.

Our applications are also split into “MicroServices” - this means instead of a single program doing everything, we split up functionality into lots of smaller applications which talk to each other to fulfil some wider purpose - which are easier to maintain and provide some fault tolerance (e.g. if one stops working it doesn’t necessarily mean the whole of TIS stops working!)

The root cause of this issue is that some failure in a deployment meant that one of these Availability Zones was missing one of our applications, so when a user tried to login and their request was routed to this Availability Zone, the process would fail!

Fixing this was as simple as redeploying the affected application (a zero downtime operation).

Trigger

Detection

A user reported issues with submitting a recommendation and that they didn’t seem able to log in to the revalidation application

Resolution

Provided a temporary workaround for user (using incognito to prevent use of cached request)
Forced redeployment of tasks in ECS for tis-revalidation-core service

Timeline

BST unless otherwise stated

18 Aug 2022 09:20 - Responding to a query about submitting recommendations, we noticed that the login process was very slow
18 Aug 2022 11:20 - User was able to complete intended task using workaround (incognito mode)
18 Aug 2022 14:28 - Ruled out an issue with the applications after inspection of logs - focussed on the request itself
18 Aug 2022 11:00 - Mobbing the problem led to the realisation that two tis-revalidation-core tasks were in the same AZ
19 Aug 2022 11:30 - Forced redeployment tasks - issue resolved

Root Cause(s)

Requests for environment information to the tis-revalidation-core service were timing out
One of the AZ’s did not have a core service task running in alongside an integration service task
When the application failed to retrieve the environment information, it defaulted to staging/localhost

Action Items

Action Items	Owner
Set auto-scaling {with min/max the same value?}
Prod API Gateway needs to send metrics & logging sent on to CloudWatch
Monitor & Alert on something that indicates a problem, e.g. HTTP Errors/Latency in API Gateway?
TBD: Put the variable in a resource with wider availability (across availability zones)

Non-technical Description

Trigger

Detection

Resolution

Timeline

Root Cause(s)

Action Items

Lessons Learned

0 Comments

2022-08-18 Reval login intermittently failing

Non-technical Description

Trigger

Detection

Resolution

Timeline

Root Cause(s)

Action Items

Lessons Learned

0 Comments