Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 4 Next »

Date

Authors

Andy Dingley

Status

Implementing

Summary

Several TIS services, namely Generic Upload and ESR services, could not authenticate with Keycloak. Resulting in outages in those functions.

Impact

Admins not able to upload any bulk import/update spreadsheets, TODO: ESR impact? ESR Bi-directional not able to import/export changes?

Non-technical Description

TIS is made up of multiple “microservices”, small components with individual responsibilities, which work together to provide the full TIS application. One such example is the “bulk upload” microservice which provides all of TIS’s bulk upload/update/create functionality.

These microservices connect to each other in order to perform their tasks, for example the bulk upload microservice extracts data from the speadsheet and sends it to another microservice which is capable of handling the person/placement/assessment/etc. data.

Before a microservice can connect to another microservice it must authenticate (log in) to gain access, in a similar way to how users log in to Admins UI.

We experienced a configuration issue which stopped those authentication requests from being sent, as a result the microservice could not “log in” and any subsequent connection to other microservices would have been denied.

In the case of bulk upload this meant that we were unable to process the uploaded spreadsheets as the extracted data could not be sent to a microservice capable of handling it.


Trigger

  • Apache/Docker updated on the Stage, Prod and management VMs.

Detection

  • Generic Upload: reported by user on Teams

  • ESR: detected by Sentry?


Resolution

  • Fix the hosts files across each service/environment


Timeline

BST unless otherwise stated

  • - Blue stage server upgraded. TIS continued to function normally after this upgrade. Used to develop a sequence for the upgrade of other ec2 instances.

  • 10:30 to 15:13 - Green stage and monitoring server were upgraded. The build server was partially upgraded.

  • 14:xx - Message to James Harris querying whether it was possible to do bulk uploads, no indication that they were encountering an issue.

  • 17:15 - Paused production monitoring and begun applying upgrades to blue production.

  • 17:20 - Hit similar issues on production servers of packages remaining from images migrated to AWS but sorted out.

  • 18:20 - Prod inaccessible while being upgraded.

  • 19:04 - Validated that upgrade working on blue server and upgrade of green server began.

  • 19:42 - Prod appeared fully accessible with upgraded components, monitoring re-enabled.

  • 09:44 - TIS Admin reported, via TIS Support Channel, that they were getting an “Unknown Server Error” when performing bulk uploads.

  • 10:47 - Users informed on TIS Support Channel that we were aware of a bulk upload issue affecting all users and were investigating.

  • 12:16 - Users informed on TIS Support Channel that we had deployed a fix for bulk upload.

  • 15:06 - Networking change applied to production and workaround hotfix removed (after validating in stage environment).

  • 17:06 to 18:40 - ESR integration services re-enabled and monitored for processing. 1 container definition required modified networking


Root Cause(s)

  • The services were no longer able to access Keycloak via the public URL, which resolved to a loopback addressi

  • The hosts file was no configured correctly

  • The Apache/Docker upgrade caused/required some unexpected configuration changes

  • ???


Action Items

Action Items

Owner


Lessons Learned

  • Service Dependencies

  • No labels

0 Comments

You are not logged in. Any changes you make will be marked as anonymous. You may want to Log In if you already have an account.