Date | |
Authors | |
Status | Done |
Summary | The AWS monthly SMS limit was exceed, resulting in no codes being sent for SMS MFA or phone number verification |
Impact | TSS users were unable to sign up/in for approximately 10 hours (96 users affected, with 300 attempts total) |
Non-technical Description
SMS messages are sent for two actions
Verifying a phone number during SMS MFA setup
Signing in with SMS MFA
A monthly spend cap must be set on our account, this was previously raised to $200 per month but on this occasion we exceeded that limit and SMS messages could no longer be sent.
As a result the two actions noted above were not possible until the limit was raised, approximate impact:
10 hours of SMS downtime
96 users (based on phone number)
307 total failed SMS messages
Trigger
SMS limit exceeded
Detection
Identified when TIS team member was unable to sign in to TSS
Resolution
Increase monthly SMS limit from $200 to $300
Modify the monitoring to alert at $270
Timeline
BST unless otherwise stated
01:10 - Alert sent to #monitoring channel on Slack that we had reached 90% of our SMS limit
10:00 - Daily alarm reminder sent to #monitoring channel
10:00 - Daily alarm reminder sent to #monitoring channel
12:59 - SMS limit exceeded, messages no longer being sent
20:17 - TIS dev team member unable to sign in using SMS MFA - no message received, thought to be a phone verification issue
22:22 - TIS dev team member unable to verify phone number to set up SMS MFA - problem identified as SMS limit
22:37 - SMS limit increased to $300 - verified able to receive messages again
22:41 - All clear notification on #monitoring Slack channel
Root Cause(s)
The SMS costs exceeded the configured limits
The limit was not set appropriately
Data based limits not yet set, still using guesswork
The alert for reaching 90% of the limit was not seen
Noisy #monitoring channel due to
tis-log-size
alertsPlacement sync log spam
Prod → Stage DB sync log spam
Alarm not configured appropriately - e.g. using more datapoints would help avoid the yoyoing of the alert status which causes slack spam
The daily CloudWatch alarm reminders were not checked/actioned
Only 3 alarms are visible in Slack notice (and an ‘and X more…’ message), and these tend to be the usual
tis-trainee-sync
DLQ errors which are often ignoredThe SMS alarm would only be seen by visiting the CloudWatch page directly
Action Items
Action Items | Owner |
---|---|
Set up more appropriate SMS limits | |
Trial incorporating a review of daily alarm reminders in standup |
Lessons Learned
Monitoring alerts are useless unless they are noticed/actioned
0 Comments