Date	May 6, 2022
Authors	@Joseph (Pepe) Kelly
Status	Done
Summary	There were “stale” references in the web application for HEE and services for NIMDTA (TCS, Assessments, User Management, Refence)
Impact	Admins not see the records they should have had access to

Non-technical Description

TIS is made up of multiple “microservices”, small components with individual responsibilities, which work together to provide the full TIS application. One such example is the “assessments” microservice which provides all of TIS’s assessment functionality.

Many of these microservices connect to the “profile” microservice to check what each particular user is allowed to see and do on TIS.

The profile microservice has been moved during the week, with the previous version being switched off on Wednesday 1st June. We experienced a configuration issue which meant some microservices were still attempting to connect to the previous version.

This caused some users to see a constantly refreshing page, while for others, TIS loaded without their permission to see any records.

This was resolved restarting failing microservices so they were aware of the new location.

Trigger

Previous version of profile microservice switched off.

Detection

Identified by TIS team & reported by a user on Teams

Resolution

Re-run playbook for the api-gateway (HEE)
Restarted a number of microservices (NIMDTA)

Timeline

BST unless otherwise stated

May 30, 2022 to May 31, 2022 - New version released and verified to be running correctly
May 31, 2022 16:35 - Logging threshold dropped to monitor for connections to previous version
Jun 1, 2022 10:30 to 11:20 - Manual verification of application in HEE production and log data across all environments (assumed to have passed verification due to existing session)
Jun 1, 2022 11:22 - Legacy version stopped following verification
Jun 1, 2022 11:31 - User Reports of problems (HEE)
Jun 1, 2022 12:02 - Temporarily re-enabled the legacy profile microservice (fixed for HEE)
Jun 1, 2022 12:02 to 12:18 - Reapplied web configuration to use the new service (all environments)
Jun 1, 2022 13:47 - User reports of problems (NIMDTA)
Jun 1, 2022 13:56 - Temporarily re-enabled the legacy profile microservice (fixed for NIMDTA)
Jun 1, 2022 14:25 to 14:39 - Restarted NIMDTA microservices to pick up the location of the new profile microservice

Root Cause(s)

There were outdated references to the previous version of Profile being used.
Missing steps:
- Could have marked off against a comprehensive list
- Distractions: other meetings/calls/PRs
Didn’t do the same level of verification across all “production” instances.
Verified using existing session & monitoring reported healthy responses.

Action Items

Action Items	Owner

Action Items	Owner
Automated tests: Check uptime robot healthcheck fails with bad credentials; can we do a check for specific text?	@Reuben Roberts / @Joseph (Pepe) Kelly
set up smoke tests, possibly triggered “on-demand” API tests (e.g. Postman)	https://hee-tis.atlassian.net/browse/TIS21-3083
How would we use VPC Flow Logs, Reachability Analyzer to test / spot issues following configuration updates	https://hee-tis.atlassian.net/browse/TIS21-3084
Are there metrics to show what impact this (or any) kind of outage has caused.	https://hee-tis.atlassian.net/browse/TIS21-3085

Lessons Learned

No perfect monitoring.
I expect that using “Flow Logs” or other network analysis would have picked this up

TIS21 Confluence Space

2022-06-01 Several TIS services not able to use TIS authorisation

Analytics