2022-06-01 Several TIS services not able to use TIS authorisation

Date

May 6, 2022

Authors

@Joseph (Pepe) Kelly

Status

Done

Summary

There were “stale” references in the web application for HEE and services for NIMDTA (TCS, Assessments, User Management, Refence)

Impact

Admins not see the records they should have had access to

Non-technical Description

TIS is made up of multiple “microservices”, small components with individual responsibilities, which work together to provide the full TIS application. One such example is the “assessments” microservice which provides all of TIS’s assessment functionality.

Many of these microservices connect to the “profile” microservice to check what each particular user is allowed to see and do on TIS.

The profile microservice has been moved during the week, with the previous version being switched off on Wednesday 1st June. We experienced a configuration issue which meant some microservices were still attempting to connect to the previous version.

This caused some users to see a constantly refreshing page, while for others, TIS loaded without their permission to see any records.

This was resolved restarting failing microservices so they were aware of the new location.


Trigger

  • Previous version of profile microservice switched off.

Detection

  • Identified by TIS team & reported by a user on Teams


Resolution

  • Re-run playbook for the api-gateway (HEE)

  • Restarted a number of microservices (NIMDTA)


Timeline

BST unless otherwise stated

  • May 30, 2022 to May 31, 2022 - New version released and verified to be running correctly

  • May 31, 2022 16:35 - Logging threshold dropped to monitor for connections to previous version

  • Jun 1, 2022 10:30 to 11:20 - Manual verification of application in HEE production and log data across all environments (assumed to have passed verification due to existing session)

  • Jun 1, 2022 11:22 - Legacy version stopped following verification

  • Jun 1, 2022 11:31 - User Reports of problems (HEE)

  • Jun 1, 2022 12:02 - Temporarily re-enabled the legacy profile microservice (fixed for HEE)

  • Jun 1, 2022 12:02 to 12:18 - Reapplied web configuration to use the new service (all environments)

  • Jun 1, 2022 13:47 - User reports of problems (NIMDTA)

  • Jun 1, 2022 13:56 - Temporarily re-enabled the legacy profile microservice (fixed for NIMDTA)

  • Jun 1, 2022 14:25 to 14:39 - Restarted NIMDTA microservices to pick up the location of the new profile microservice


Root Cause(s)

  • There were outdated references to the previous version of Profile being used.

  • Missing steps:

    • Could have marked off against a comprehensive list

    • Distractions: other meetings/calls/PRs

  • Didn’t do the same level of verification across all “production” instances.

  • Verified using existing session & monitoring reported healthy responses.

 


Action Items

Action Items

Owner

Action Items

Owner

Automated tests:

  • Check uptime robot healthcheck fails with bad credentials; can we do a check for specific text?

@Reuben Roberts / @Joseph (Pepe) Kelly

set up smoke tests, possibly triggered “on-demand”

API tests (e.g. Postman)

https://hee-tis.atlassian.net/browse/TIS21-3083

How would we use VPC Flow Logs, Reachability Analyzer to test / spot issues following configuration updates

https://hee-tis.atlassian.net/browse/TIS21-3084

Are there metrics to show what impact this (or any) kind of outage has caused.

https://hee-tis.atlassian.net/browse/TIS21-3085

 

 


Lessons Learned

  • No perfect monitoring.

  • I expect that using “Flow Logs” or other network analysis would have picked this up