2022-06-01 Several TIS services not able to use TIS authorisation
Date | May 6, 2022 |
Authors | @Joseph (Pepe) Kelly |
Status | Done |
Summary | There were “stale” references in the web application for HEE and services for NIMDTA (TCS, Assessments, User Management, Refence) |
Impact | Admins not see the records they should have had access to |
Non-technical Description
TIS is made up of multiple “microservices”, small components with individual responsibilities, which work together to provide the full TIS application. One such example is the “assessments” microservice which provides all of TIS’s assessment functionality.
Many of these microservices connect to the “profile” microservice to check what each particular user is allowed to see and do on TIS.
The profile microservice has been moved during the week, with the previous version being switched off on Wednesday 1st June. We experienced a configuration issue which meant some microservices were still attempting to connect to the previous version.
This caused some users to see a constantly refreshing page, while for others, TIS loaded without their permission to see any records.
This was resolved restarting failing microservices so they were aware of the new location.
Trigger
Previous version of profile microservice switched off.
Detection
Identified by TIS team & reported by a user on Teams
Resolution
Re-run playbook for the api-gateway (HEE)
Restarted a number of microservices (NIMDTA)
Timeline
BST unless otherwise stated
May 30, 2022 to May 31, 2022 - New version released and verified to be running correctly
May 31, 2022 16:35 - Logging threshold dropped to monitor for connections to previous version
Jun 1, 2022 10:30 to 11:20 - Manual verification of application in HEE production and log data across all environments (assumed to have passed verification due to existing session)
Jun 1, 2022 11:22 - Legacy version stopped following verification
Jun 1, 2022 11:31 - User Reports of problems (HEE)
Jun 1, 2022 12:02 - Temporarily re-enabled the legacy profile microservice (fixed for HEE)
Jun 1, 2022 12:02 to 12:18 - Reapplied web configuration to use the new service (all environments)
Jun 1, 2022 13:47 - User reports of problems (NIMDTA)
Jun 1, 2022 13:56 - Temporarily re-enabled the legacy profile microservice (fixed for NIMDTA)
Jun 1, 2022 14:25 to 14:39 - Restarted NIMDTA microservices to pick up the location of the new profile microservice
Root Cause(s)
There were outdated references to the previous version of Profile being used.
Missing steps:
Could have marked off against a comprehensive list
Distractions: other meetings/calls/PRs
Didn’t do the same level of verification across all “production” instances.
Verified using existing session & monitoring reported healthy responses.
Action Items
Action Items | Owner |
---|---|
Automated tests:
| @Reuben Roberts / @Joseph (Pepe) Kelly |
set up smoke tests, possibly triggered “on-demand” API tests (e.g. Postman) | |
How would we use VPC Flow Logs, Reachability Analyzer to test / spot issues following configuration updates | |
Are there metrics to show what impact this (or any) kind of outage has caused. | |
|
|
Lessons Learned
No perfect monitoring.
I expect that using “Flow Logs” or other network analysis would have picked this up
Slack: https://hee-nhs-tis.slack.com/
Jira issues: https://hee-tis.atlassian.net/issues/?filter=14213