2022-08-18 Reval login intermittently failing
Date | Aug 18, 2022 |
Authors | @Joseph (Pepe) Kelly @Cai Willis |
Status | Resolved |
Summary | Revalidation production login delay and logout redirecting to stage and localhost |
Impact | An admin user was having issues to submit recommendation |
Non-technical Description
The login process for the revalidation application was intermittently failing as it was unable to retrieve environment information (Production/Staging/etc).
Our applications live in the cloud - this means that another company (in this case Amazon Web Services) manages the physical machines that are running our applications and provides us with features that make our applications more robust and failure tolerant. In this case the relevant feature is “Availability Zones” - Our application lives in multiple data centres simultaneously - this means that if one of these data centres (and sometimes each Availability Zone is itself a group of data centres!) suffers even a catastrophic failure (e.g. fire, flooding or power loss) then the other instances of our application will still be running fine! AWS then manages how traffic is split between these zones so that this duplication is invisible to the user.
Our applications are also split into “MicroServices” - this means instead of a single program doing everything, we split up functionality into lots of smaller applications which talk to each other to fulfil some wider purpose - which are easier to maintain and provide some fault tolerance (e.g. if one stops working it doesn’t necessarily mean the whole of TIS stops working!)
The root cause of this issue is that some failure in a deployment meant that one of these Availability Zones was missing one of our microservices, so when a user tried to login and their request was routed to this Availability Zone, the process would fail!
Fixing this was as simple as redeploying the affected application (a zero downtime operation).
Trigger
Detection
A user reported issues with submitting a recommendation and that they didn’t seem able to log in to the revalidation application
Resolution
Provided a temporary workaround for user (using incognito to prevent use of cached request)
Forced redeployment of tasks in ECS for tis-revalidation-core service
Timeline
BST unless otherwise stated
Aug 18, 2022 09:20 - Responding to a query about submitting recommendations, we noticed that the login process was very slow
Aug 18, 2022 11:20 - User was able to complete intended task using workaround (incognito mode)
Aug 18, 2022 14:28 - Ruled out an issue with the applications after inspection of logs - focussed on the request itself
Aug 18, 2022 11:00 - Mobbing the problem led to the realisation that two tis-revalidation-core tasks were in the same AZ
Aug 19, 2022 11:30 - Forced redeployment tasks - issue resolved
Root Cause(s)
Requests for environment information to the tis-revalidation-core service were timing out
One of the AZ’s did not have a core service task running in alongside an integration service task
When the application failed to retrieve the environment information, it defaulted to staging/localhost
Action Items
Action Items | Owner |
---|---|
Set auto-scaling {with min/max the same value?} |
|
Prod API Gateway needs to send metrics & logging sent on to CloudWatch |
|
Monitor & Alert on something that indicates a problem, e.g. HTTP Errors/Latency in API Gateway? |
|
TBD: Put the variable in a resource with wider availability (across availability zones) |
|
Lessons Learned
Slack: https://hee-nhs-tis.slack.com/
Jira issues: https://hee-tis.atlassian.net/issues/?filter=14213