2022-07-10 TSS unable to send SMS messages

Date

Jun 28, 2022

Authors

@Andy Dingley

Status

Done

Summary

The AWS monthly SMS limit was exceed, resulting in no codes being sent for SMS MFA or phone number verification

Impact

TSS users were unable to sign up/in for approximately 10 hours (96 users affected, with 300 attempts total)

Non-technical Description

SMS messages are sent for two actions

  • Verifying a phone number during SMS MFA setup

  • Signing in with SMS MFA

A monthly spend cap must be set on our account, this was previously raised to $200 per month but on this occasion we exceeded that limit and SMS messages could no longer be sent.

As a result the two actions noted above were not possible until the limit was raised, approximate impact:

  • 10 hours of SMS downtime

  • 96 users (based on phone number)

  • 307 total failed SMS messages


Trigger

  • SMS limit exceeded

Detection

  • Identified when TIS team member was unable to sign in to TSS


Resolution

  • Increase monthly SMS limit from $200 to $300

  • Modify the monitoring to alert at $270


Timeline

BST unless otherwise stated

  • Jun 25, 2022 01:10 - Alert sent to #monitoring channel on Slack that we had reached 90% of our SMS limit

  • Jun 27, 2022 10:00 - Daily alarm reminder sent to #monitoring channel

  • Jun 28, 2022 10:00 - Daily alarm reminder sent to #monitoring channel

  • Jun 28, 2022 12:59 - SMS limit exceeded, messages no longer being sent

  • Jun 28, 2022 20:17 - TIS dev team member unable to sign in using SMS MFA - no message received, thought to be a phone verification issue

  • Jun 28, 2022 22:22 - TIS dev team member unable to verify phone number to set up SMS MFA - problem identified as SMS limit

  • Jun 28, 2022 22:37 - SMS limit increased to $300 - verified able to receive messages again

  • Jun 28, 2022 22:41 - All clear notification on #monitoring Slack channel


Root Cause(s)

  • The SMS costs exceeded the configured limits

    • The limit was not set appropriately

      • Data based limits not yet set, still using guesswork

    • The alert for reaching 90% of the limit was not seen

      • Noisy #monitoring channel due to tis-log-size alerts

        • Placement sync log spam

        • Prod → Stage DB sync log spam

        • Alarm not configured appropriately - e.g. using more datapoints would help avoid the yoyoing of the alert status which causes slack spam

    • The daily CloudWatch alarm reminders were not checked/actioned

      • Only 3 alarms are visible in Slack notice (and an ‘and X more…’ message), and these tend to be the usual tis-trainee-sync DLQ errors which are often ignored

      • The SMS alarm would only be seen by click ‘and X more…’ to visit the CloudWatch page directly.

 


Action Items

Action Items

Owner

Action Items

Owner

Set up more appropriate SMS limits

https://hee-tis.atlassian.net/browse/TIS21-3108

Trial incorporating a review of daily alarm reminders in standup

 


Lessons Learned

  • Monitoring alerts are useless unless they are noticed/actioned