2017-09-28 Keycloak database backup failed
Date | 2017-09-28 |
Authors | Graham O'Regan (Unlicensed) |
Status | Complete |
Summary | A fault in the database server forced us to restore the data from backups. The application back ups worked as expected but their was a problem with the Keycloak database which prevented users logging in. |
Impact | Users could not login to the applications. |
Root Cause
The mysql.event table was corrupt which prevented updates to the databases from working correctly. Attempts to fix the table failed so the database server was rebuilt and the application backups were restored. There was a problem with the Keycloak database backup which appeared to restore without raising any errors but the application couldn't match client credentials which prevented users from logging in.
Trigger
We wanted to run the Consolidated ETL against TCS in production against the new National Dataware House SQL Server. We agreed to clear the existing data from the TCS database to ensure that the matched the new mirror exactly so the application was stopped, the database was dropped and recreated and the application was restarted. As the application booted up, it attempted to run the Flyway migrations to recreate the tables but the MySQL server through an error because of the corrupt event table.
Resolution
First of all, the database server was rebuilt and the data restored from backups. Once that was complete, the Keycloak database from the old platform was used to restore the new platform and the application worked as expected.
Detection
Trying to restart the TCS application after clearing its database threw the exception raised by the database server and the application failed to start.
Action Items
Action Item | Type | Owner | Issue |
---|---|---|---|
Database backups should be changed to backup each database individually | mitigate | Fayaz Abdul (Unlicensed) | |
Backups should be tested nightly | mitigate | Fayaz Abdul (Unlicensed) Chris Mills (Unlicensed) | |
Application healthchecks should execute requests against all dependent services | mitigate | Alex Dobre (Unlicensed) |
Timeline
Sept 27th
15:30 Graham O'Regan (Unlicensed) Attempted to clear the TCS database and restart the application but it failed to start due to the currupt event table.
15:54 Graham O'Regan (Unlicensed) Let #dev know that he planned to restart the server out of hours
20:00 Graham O'Regan (Unlicensed) ran consistency checks and repairs against the database and restarted the process. The docker daemon on the application server was also restarted to refresh the application database connection pools.
22:00 After various attempts to drop individual databases, the database server was rebuilt and the data restored from backups.
23:00 Once the database was back up, the Revalidation and Concerns apps were tested to confirm that they were working.
23:05 The consolidation ETL was started, it failed quickly with authentication exceptions so an attempt was made to resolve the issue
23:45 The applications were left running but without the consolidated data for TCS only.
Sept 28th
5:30 The Revalidation ETL apptempted to run but failed due to authentication issues.
9:26 Alex Dobre (Unlicensed) confirmed that the requests were failing on the keycloak calls so that database.
10:02 Fayaz Abdul (Unlicensed) started a restore of the keycloak database from the old database server that had been decommisioned at the end of the previous week.
10:09 Fayaz Abdul (Unlicensed) confirmed that users could login
10:17 Graham O'Regan (Unlicensed) confirmed that the keycloak admin account was wroking correctly.
Supporting Information
graham [2:07 PM] k, i did that, just checking it now @fayaz don’t worry, i’m working on it fayaz [2:07 PM] @graham - are you talking about tcs graham [2:08 PM] yup emanuele [2:32 PM] @reubs re-deployed on dev, login screens now refresh every 270 seconds reubs [2:34 PM] brill thanks, lets see if that works around the login action timeout :slightly_smiling_face: graham [2:50 PM] @alex.dobre @apringle @reubs i need to restart the database process to complete the tcs work. i don’t want to do that during office hours so i’ll run the update tonight and kick off the etl once it has cleaned up. apringle [2:50 PM] ok reubs [2:59 PM] cool [2:59] Ben has just confirmed Hicom are renabling Intrepid access across local teams btw srochani [3:06 PM] @emanuele ProgrammeRotationMembership is same table like ProgrammeMembership except additional field rotation, we have already rotation field in ProgrammeMembership which can be null, I think no need to implement the ProgrammeRotationMembership on server side. @ify-oo, @russ, @apringle, please correct me if I’m wrong emanuele [3:07 PM] @srochani mmm not sure. What I basically need is a list of options to fill the rotation dropdown for ProgrammeMembership apringle [3:09 PM] do we know if this data is in DR2? russ [3:12 PM] uploaded this image: [vwProgrammeRotationMembership] DR v1 Add Comment russ [3:14 PM] uploaded and commented on this image: [vwProgrammeRotationMembership] DR v2 1 Comment @apringle data exists in DR v2 also @emanuele you can get these values using this select; select rotation, count(*) From [HEELIVE].[vwProgrammeRotationMembership] as prm group by rotation apringle [3:15 PM] these rotations are local and will only need to be visible to users from the relevant local office. emanuele [3:20 PM] @russ can’t do that, I need an endpoint :wink: fayaz [3:24 PM] @here: will be rolling out all the backport changes to jenkins tonight, jenkins might be unavailable for few mins during that time. Also will be testing the creation of disk snapshots. graham [3:54 PM] @here i’ll be restarting the live DB servers tonight, I may have to restart the apps to pick up the changes too. Once that is done I’ll rerun the consolidation ETL against the new DB server fayaz [3:56 PM] @here: https://hee-tis.atlassian.net/wiki/spaces/TISDEV/pages/105873485/Terraform fayaz [4:17 PM] @chrism: https://github.com/Health-Education-England/TIS-DEVOPS/pull/36 jenkins APP [4:19 PM] ---------------- devops - #1162 Failure after 1.4 sec (Open) chrism [6:06 PM] I fixed the above for you....by ignoring that test. fayaz [6:14 PM] ha ha let me take a look chrism [6:47 PM] Its just looking for dev2 then does != against the vars from dev. [6:47] So it would error even if we pointed it at old-dev2 fayaz [6:57 PM] its fine [6:57] @graham - I ran the jenkins playbook, you can carry on with your tests ----- Today September 28th, 2017 ----- jenkins APP [5:30 AM] ---------------- intrepid-reval-etl-all-prod - #345 Failure after 2.6 sec (Open) graham [7:27 AM] @channel all calls on prod are failing because of a 401 calling profile service, anyone know what has started causing that? reubs [8:22 AM] Did you deploy KC this morning? graham [8:28 AM] no, too many things not working :disappointed: [8:28] the apps are up but the ETLs can’t run chrism [8:31 AM] The same service I was trying to call on dev? Sunil said something about the token expiring soon possibly so could check that (edited) graham [8:33 AM] no, there is a hack in place where the http clients populate the headers that openid connect would add. some of the clients don’t have it so they are failing :confused: [8:33] the token isn’t validated properly when the hack is used to the presense of a valid token is all that is required chrism [8:34 AM] Oidc_access_token :face_with_rolling_eyes: [8:34] Im in soon so anything i can do to help let me know graham [8:36 AM] trying to work out what the best thing is to do, not sure if some of the services should be rolled back because this has only just started happening or whether a quick proxy between services to add the headers would be the quickest hack to get it back up and running [8:37] we could copy the developer’s nginx settings and run sidecar container [8:37] would mean changing the docker-compose files to make the new container look like the resource server [8:37] brb :coffee: needed chrism [8:41 AM] Just add the headers to the gateway temp till we find the problem [8:42] That way they'll run. Only issue would be sec if someone went and looked at those headers [8:44] On the proxy reverse add header add/requestheader set? apringle [8:54 AM] @graham @alex.dobre have you seen Andys message in reval? reubs [9:00 AM] Logins failing basically, will be the same issue I assume graham [9:04 AM] not related, my login is working ok [9:06] @channel can everyone else login ok? reubs [9:07 AM] no my prod account says the same thing [9:07] invalid username or password graham [9:08 AM] k, i’m loking into it reubs [9:08 AM] thanks [9:08] must be related surely? graham [9:08 AM] looks like this is an issue with the database failure yesterday [9:08] nope, not the same reubs [9:08 AM] ok [9:09] my account on prod is 'reubs' graham [9:10 AM] we’re going to ahve to tell the admins not to use the app until we’re finished [9:10] can you message on #general? [9:10] @reubs ^ reubs [9:11 AM] I think #revalidation more likely to reach them, and Alistair with an email [9:11] I'll coord [9:11] @apringle - quick call? apringle [9:11 AM] ok alex.dobre [9:20 AM] yes I see all logins are failing [9:20] looking into it graham [9:20 AM] @chrism the only prob with that is that the apps use docker networks to connect to the process directly, they aren’t going back thru the api-gateway currently [9:22] @alex.dobre i’m looking at the keycloak issue now, it isn’t letting all users in but i’ve been able to login ok [9:22] i’ll let you know when i get this fixed up alex.dobre [9:23 AM] ok [9:23] profile service seems to be up but it has not been hit in the past 15 mins reubs [9:28 AM] well if no one can login .... alex.dobre [9:29 AM] this means that requests stop at Keycloak chrism [9:29 AM] They stop at the gateway [9:29] `HTTP: attempt to connect to 127.0.0.1:8087 (localhost) failed` [9:30] `HTTP: failed to make connection to backend: localhost, referer: https://apps.tis.nhs.uk/auth/realms/master/protocol/openid-connect/auth?client_id=security-admin-console&redirect_uri=https%3A%2F%2Fapps.tis.nhs.uk%2Fauth%2Fadmin%2Fmaster%2Fconsole%2F&state=1be8f10b-7845-4e83-8d9c-304263c4680a&nonce=0c5653e0-38e4-4a05-bf6e-7a2dd0220672&response_mode=fragment&response_type=code&scope=openid` alex.dobre [9:31 AM] @graham Now I see the logs of gmc sync - it could not connect to KeyCloak to get it's token so it failed [9:31] HTTP 401 Unauthorized -> org.keycloak.admin.client.token.TokenManager.grantToken(TokenManager.java:89) [9:32] last time I ran it yesterday evening and was ok graham [9:33 AM] k, hoepfully this is the root of the issue
Slack: https://hee-nhs-tis.slack.com/
Jira issues: https://hee-tis.atlassian.net/issues/?filter=14213