2022-05-06 Several TIS services not able to authenticate internally
Date | May 6, 2022 |
Authors | @Andy Dingley @Joseph (Pepe) Kelly @John Simmons (Deactivated) |
Status | Done |
Summary | Several TIS services, namely Generic Upload and ESR services, could not authenticate with Keycloak. Resulting in outages in those functions. |
Impact | Admins not able to upload any bulk import/update spreadsheets, TODO: ESR impact? ESR Bi-directional not able to import/export changes? |
Non-technical Description
TIS is made up of multiple “microservices”, small components with individual responsibilities, which work together to provide the full TIS application. One such example is the “bulk upload” microservice which provides all of TIS’s bulk upload/update/create functionality.
These microservices connect to each other in order to perform their tasks, for example, the bulk upload microservice extracts data from the spreadsheet and sends it to another microservice which is capable of handling the person/placement/assessment/etc. data.
Before a microservice can connect to another microservice it must authenticate (log in) to gain access, in a similar way to how users log in to Admins UI.
We experienced a configuration issue which stopped those authentication requests from being sent, as a result, the microservice could not “log in” and any subsequent connection to other microservices would have been denied.
In the case of bulk upload, this meant that we were unable to process the uploaded spreadsheets as the extracted data could not be sent to a microservice capable of handling it.
This was resolved by giving the servers a way of finding the authentication service (and any other service) that can be found from inside a single server or across multiple servers. Therefore the requests should always be able to find the correct route.
Trigger
Apache/Docker updated on the Stage, Prod, Nimdta and management VMs.
Detection
Generic Upload: reported by a user on Teams
ESR: detected by alertmanager but seen by @Joseph (Pepe) Kelly when pausing alerts for work that was already being undertaken
Resolution
Fix the hosts files across each service/environment
Updated the DNS Nameservers to point externally to the network
Timeline
BST unless otherwise stated
Apr 29, 2022 - Blue stage server upgraded. TIS continued to function normally after this upgrade. Used to develop a sequence for the upgrade of other ec2 instances.
May 5, 2022 10:30 to 15:13 - Green stage and monitoring server were upgraded. The build server was partially upgraded.
May 5, 2022 14:xx - Message to @James Harris querying whether it was possible to do bulk uploads, no indication that they were encountering an issue.
May 5, 2022 17:15 - Paused production monitoring and begun applying upgrades to blue production.
May 5, 2022 17:20 - Hit similar issues on production servers of packages remaining from images migrated to AWS but sorted out.
May 5, 2022 18:20 - Prod inaccessible while being upgraded.
May 5, 2022 19:04 - Validated that upgrade working on blue server and upgrade of green server began.
May 5, 2022 19:42 - Prod appeared fully accessible with upgraded components, monitoring re-enabled.
May 6, 2022 09:44 - TIS Admin reported, via TIS Support Channel, that they were getting an “Unknown Server Error” when performing bulk uploads.
May 6, 2022 10:47 - Users were informed on TIS Support Channel that we were aware of a bulk upload issue affecting all users and were investigating.
May 6, 2022 12:16 - Users were informed on TIS Support Channel that we had deployed a fix for bulk upload.
May 6, 2022 15:06 - Networking change applied to production and workaround hotfix removed (after validating in stage environment).
May 6, 2022 17:06 to 18:40 - ESR integration services re-enabled and monitored for processing. 1 container definition required modified networking
May 9, 2022 11:55 - Verified that there were no files pending export and files had been produced
Root Cause(s)
The services were no longer able to access Keycloak via the public URL, which resolved to a loopback address
The hosts file was not configured correctly (not so much of a root cause, but defiantly something that needed to be corrected once found)
The Apache/Docker upgrade caused/required some unexpected configuration changes
The change in major OS version (Ubuntu 16.04 to Ubuntu 18.04) looks like it reset the custom DNS settings we were using originally, re-applying these have made the apps run as expected.
Action Items
Action Items | Owner |
---|---|
add playbook/terraform config to fix DNS resolver | @John Simmons (Deactivated) |
move ESR applications off to ECS asap | @John Simmons (Deactivated) @Joseph (Pepe) Kelly |
Check app records for ESR 7700 (not recorded as exported in TIS) → 3408 (which have been reconciled at some point) → 1,304 (Approved, deduplicated) → → 11 (generated with a “no-change” update) | @Joseph (Pepe) Kelly |
check ESR and TIS applications are using addresses that can be resolved nomatter what platform they are using | @John Simmons (Deactivated) |
|
|
Lessons Learned
Service Dependencies:
On major OS upgrades, build a new server from scratch and test it all in place then replace instances with old live instances
Slack: https://hee-nhs-tis.slack.com/
Jira issues: https://hee-tis.atlassian.net/issues/?filter=14213