2022-11-28 TSS unable to send SMS messages

Date

Nov 28, 2022

Authors

@Reuben Roberts

Status

Documenting

Summary

The AWS monthly SMS limit was exceed, resulting in no codes being sent for SMS MFA or phone number verification

Impact

TSS users were unable to sign up/in for approximately 24 hours (226 users affected, with 866 attempts total)

Non-technical Description

SMS messages are sent for two actions

  • Verifying a phone number during SMS MFA setup

  • Signing in with SMS MFA

A monthly spend cap must be set on our account, this was previously raised to $300 per month. However, we recently changed our SMS configuration to send from eu-west-2 (London), instead of eu-west-1. The London region was misconfigured to have a $100 limit. We exceeded that limit and SMS messages could no longer be sent.

As a result the two actions noted above were not possible until the limit was raised, approximate impact:

  • 24 hours of SMS downtime

  • 226 users (based on phone number)

  • 866 total failed SMS messages


Trigger

  • SMS limit exceeded (due to change to SMS Region with default low limit in place)


Detection

  • Identified when TIS team member was unable to sign in to TSS


Resolution

  • Increase monthly SMS limit from $100 to $300 for eu-west-2


Timeline

GMT unless otherwise stated

  • Nov 28, 2022 12:31 - SMS $100 limit exceeded, messages no longer being sent. No alarm was raised because the alarm was configured to only trigger when monthly spend reached 90% of $300, i.e. $270.

  • Nov 29, 2022 11:59 - Issue noticed by TSS team

  • Nov 29, 2022 12:08 - Manual switch to use eu-west-1 region for SMS while request raised with AWS Support to increase the limit on eu-west-2

  • Nov 29, 2022 12:20 - Limit increased to $300 on eu-west-2

  • Nov 29, 2022 12:21 - Manual revert to once again use eu-west-2 region for SMS


Root Cause(s)

  • The SMS costs exceeded the configured limits

    • The limit was not set appropriately

      • The eu-west-2 region had not had the same SMS spend limit applied as in eu-west-1

    • The alert for reaching 90% of the limit did not trigger

      • This would only trigger when we reached $270 spend, since it was based on an assumed limit of $300, not the $100 that actually applied

    • The switch from eu-west-1 to eu-west-2 for SMS was not thoroughly checked

      • SMS limit and alarm configuration was not included in the package of work


Action Items

Action Items

Owner

Action Items

Owner

Terraform eu-west-2 SMS config and SMS limit alarm

https://github.com/Health-Education-England/TIS-OPS/pull/526

Populate this Epic placeholder with steps and stories (maybe do a FeatureMap exercise?)

https://hee-tis.atlassian.net/browse/TIS21-3764


Lessons Learned

  • Terraform first

  • Test every change even if you think it's identical to the previous