Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Date

Authors

Andy Dingley

Status

Done

Summary

The AWS monthly SMS limit was exceed, resulting in no codes being sent for SMS MFA or phone number verification

Impact

TSS users were unable to sign up/in for approximately 10 hours (96 users affected, with 300 attempts total)

Non-technical Description

SMS messages are sent for two actions

...

  • 10 hours of SMS downtime

  • 96 users (based on phone number)

  • 307 total failed SMS messages

...

Trigger

  • SMS limit exceeded

Detection

  • Identified when TIS team member was unable to sign in to TSS

...

Resolution

  • Increase monthly SMS limit from $200 to $300

  • Modify the monitoring to alert at $270

...

Timeline

BST unless otherwise stated

  • 01:10 - Alert sent to #monitoring channel on Slack that we had reached 90% of our SMS limit

  • 10:00 - Daily alarm reminder sent to #monitoring channel

  • 10:00 - Daily alarm reminder sent to #monitoring channel

  • 12:59 - SMS limit exceeded, messages no longer being sent

  • 20:17 - TIS dev team member unable to sign in using SMS MFA - no message received, thought to be a phone verification issue

  • 22:22 - TIS dev team member unable to verify phone number to set up SMS MFA - problem identified as SMS limit

  • 22:37 - SMS limit increased to $300 - verified able to receive messages again

  • 22:41 - All clear notification on #monitoring Slack channel

...

Root Cause(s)

  • The SMS costs exceeded the configured limits

    • The limit was not set appropriately

      • Data based limits not yet set, still using guesswork

    • The alert for reaching 90% of the limit was not seen

      • Noisy #monitoring channel due to tis-log-size alerts

        • Placement sync log spam

        • Prod → Stage DB sync log spam

        • Alarm not configured appropriately - e.g. using more datapoints would help avoid the yoyoing of the alert status which causes slack spam

    • The daily CloudWatch alarm reminders were not checked/actioned

      • Only 3 alarms are visible in Slack notice (and an ‘and X more…’ message), and these tend to be the usual tis-trainee-sync DLQ errors which are often ignored

      • The SMS alarm would only be seen by visiting click ‘and X more…’ to visit the CloudWatch page directly.

...

Action Items

Action Items

Owner

Set up more appropriate SMS limits

Jira Legacy
serverSystem JIRA
serverId4c843cd5-e5a9-329d-ae88-66091fcfe3c7
keyTIS21-3108

Trial incorporating a review of daily alarm reminders in standup

...

Lessons Learned

  • Monitoring alerts are useless unless they are noticed/actioned