Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 6 Current »

Date

Authors

Andy Dingley

Status

Done

Summary

The AWS monthly SMS limit was exceed, resulting in no codes being sent for SMS MFA or phone number verification

Impact

TSS users were unable to sign up/in for approximately 10 hours (96 users affected, with 300 attempts total)

Non-technical Description

SMS messages are sent for two actions

  • Verifying a phone number during SMS MFA setup

  • Signing in with SMS MFA

A monthly spend cap must be set on our account, this was previously raised to $200 per month but on this occasion we exceeded that limit and SMS messages could no longer be sent.

As a result the two actions noted above were not possible until the limit was raised, approximate impact:

  • 10 hours of SMS downtime

  • 96 users (based on phone number)

  • 307 total failed SMS messages


Trigger

  • SMS limit exceeded

Detection

  • Identified when TIS team member was unable to sign in to TSS


Resolution

  • Increase monthly SMS limit from $200 to $300

  • Modify the monitoring to alert at $270


Timeline

BST unless otherwise stated

  • 01:10 - Alert sent to #monitoring channel on Slack that we had reached 90% of our SMS limit

  • 10:00 - Daily alarm reminder sent to #monitoring channel

  • 10:00 - Daily alarm reminder sent to #monitoring channel

  • 12:59 - SMS limit exceeded, messages no longer being sent

  • 20:17 - TIS dev team member unable to sign in using SMS MFA - no message received, thought to be a phone verification issue

  • 22:22 - TIS dev team member unable to verify phone number to set up SMS MFA - problem identified as SMS limit

  • 22:37 - SMS limit increased to $300 - verified able to receive messages again

  • 22:41 - All clear notification on #monitoring Slack channel


Root Cause(s)

  • The SMS costs exceeded the configured limits

    • The limit was not set appropriately

      • Data based limits not yet set, still using guesswork

    • The alert for reaching 90% of the limit was not seen

      • Noisy #monitoring channel due to tis-log-size alerts

        • Placement sync log spam

        • Prod → Stage DB sync log spam

        • Alarm not configured appropriately - e.g. using more datapoints would help avoid the yoyoing of the alert status which causes slack spam

    • The daily CloudWatch alarm reminders were not checked/actioned

      • Only 3 alarms are visible in Slack notice (and an ‘and X more…’ message), and these tend to be the usual tis-trainee-sync DLQ errors which are often ignored

      • The SMS alarm would only be seen by click ‘and X more…’ to visit the CloudWatch page directly.


Action Items

Action Items

Owner

Set up more appropriate SMS limits

TIS21-3108 - Getting issue details... STATUS

Trial incorporating a review of daily alarm reminders in standup


Lessons Learned

  • Monitoring alerts are useless unless they are noticed/actioned

  • No labels