/
2021-05-10 Intermittent 'Service unavailable' messages

2021-05-10 Intermittent 'Service unavailable' messages

Date

10 May 2021

Authors

@Reuben Roberts @John Simmons (Deactivated)

Status

Done

Summary

https://hee-tis.atlassian.net/browse/TIS21-1555

Impact

Some users receive ‘Service unavailable’ message on TIS preventing them from accessing any of the site functionality. Clearing cookies / browser cache, or simply waiting a few minutes, would resolve the issue.

Non-technical Description

Access to TIS fails for some users, some of the time, with the browser message ‘Service unavailable’.

TIS recovers without intervention a few minutes later. Clearing the browser cache would also restore access. This was inadvertently caused by carrying out some infrastructure changes, that amended the configuration such that it became possible for a user to get into a state that blocked logging in.

Rather than make an immediate change to restore access for those affected users, we carefully made a “No Downtime” change, taking particular care not to disrupt access for the majority of users that were using TIS at the time. We are taking steps to mitigate this issue being triggered by any future infrastructure amendments.


Trigger

  • api-gateway playbook ran, removing manually amended OIDCStateMaxNumberOfCookies setting.


Detection

  • Teams notifications from users from 7 May 2021 16:17 onwards.

  • In-team Slack messages from 8 May 2021 01:20 and later that morning.


Resolution

  • Reset of OIDC (OpenID Connect) on both production servers.


Timeline

  • 7 May 2021 16:17 User reports on Teams that TIS is giving a ‘Service unavailable’ error

  • 8 May 2021 01:20 Marcello reports noticing the issue while checking that the nightly sync job has completed successfully

  • 10 May 2021 08:48 Various user reports of the same issue on Teams

  • 10 May 2021 10:19 HEE-TIS-VM-PROD-APPS-GREEN removed from EC2 load balancing cluster after team noticed that it was not logging correctly

  • 10 May 2021 10:24 HEE-TIS-VM-PROD-APPS-GREEN rebooted

  • 10 May 2021 10:33 HEE-TIS-VM-PROD-APPS-GREEN added back to EC2 load balancing cluster

  • 10 May 2021 10:46 HEE-TIS-VM-PROD-APPS-GREEN docker logging observed (this was a tangential issue, as it turned out)

  • 10 May 2021 10:50 HEE-TIS-VM-PROD-APPS-GREEN cookie policy updated to apply a missing change that was suspected to be the cause of the ‘Service unavailable’ error

  • 10 May 2021 10:51 HEE-TIS-VM-PROD-APPS-BLUE removed from EC2 load balancing cluster to force all traffic to the updated server so we could check the fix had worked

  • 10 May 2021 12:00 Verified that clearing cookies worked to resolve the issue for a user

  • 10 May 2021 12:50 OIDC configuration reset to match that of staging environment

  • 10 May 2021 12:53 HEE-TIS-VM-PROD-APPS-BLUE added back to EC2 load balancing cluster and started serving requests again

Root Cause(s)

  • Users were seeing an error from Apache webserver ‘Service unavailable’

  • Logs showed that Apache was rejecting user requests. The user had too many session authentication tokens: e.g. [Fri May 07 14:32:15.221392 2021] [auth_openidc:warn] [pid 5526:tid 139955393259264] [client 208.127.198.60:10332] oidc_authorization_request_set_cookie: the number of existing, valid state cookies (1) has exceeded the limit (1), no additional authorization request + state cookie can be generated, aborting the request

  • Apache is configured to allow one token, but inspection of the user machine showed they had three tokens.

  • The number of tokens arose from multiple simultaneous authentication attempts.

  • A configuration change was rolled out just prior to the issue being observed.

  • The limit on tokens is set with API Gateway OIDCStateMaxNumberOfCookies 1 true ('true' flushes out any excess tokens), but this setting was needed to be added manually because the infrastructure configuration tool (Ansible) couldn’t cope with that setting), so after the setting was lost there were at least two ways users could end up with multiple tokens (cookies):

    • Multiple logins across different browser sessions (within the same browser) would create multiple cookies.

    • If the user’s session expired due to inactivity, and the user then logged in again, this new log-in would also create a duplicate cookie.


Action Items

Action Items

Owner

Ticket ref

Action Items

Owner

Ticket ref

Investigate Ansible upgrade / recheck current version to permit full OIDCStateMaxNumberOfCookiesconfiguration without manual changes required

@John Simmons (Deactivated)

https://hee-tis.atlassian.net/browse/TIS21-1559

Add comment to Ansible script to highlight any required manual amendments

@John Simmons (Deactivated)

https://hee-tis.atlassian.net/browse/TIS21-1560

Check NI Apache configuration template for consistency (OAuth2.conf.j2)

@John Simmons (Deactivated)

https://hee-tis.atlassian.net/browse/TIS21-1561


Lessons Learned

  • Not all infrastructure-as-code is coded

  • Not always possible to be certain problem will not arise again but this needs to be weighed up against effort / risk of not doing so

  • TIS Team to check Teams for evidence of problems more frequently (especially around busy periods of the year)

  • Follow up actions do need to be carried out, or problems will reoccur, and solutions will sometimes need to be rediscovered