2025-02-05 - TSS Email notifications not sent
Date | 5 Feb 2025 (and other dates) |
Authors | @Reuben Roberts |
Status | Done |
Summary | Some email notifications were not sent at periods of peak load |
Impact | A small proportion of trainees would not have received their day-one programme or 12-week placement emailed notification |
Non-technical Description
Email notifications are sent to trainee doctors at various points to remind them to conduct various onboarding tasks for programmes or placements. When these are sent, we use AWS to retrieve the correct email address from their TSS login account, since it is possible this may differ from the email address held by TIS. When very large numbers of emails need to be sent at the same time, AWS rejects attempts to retrieve the email address, since it limits these requests to at most 30 per second. In these cases, the email notification is not sent.
Trigger
Notifications scheduled at a fixed time (midnight UTC) for a large numbers of programmes / placements starting on the same day creating a spike in requests to AWS Cognito.
Detection
LO reported missing notifications.
5 Whys (or other analysis of Root Cause)
Some notification emails were not sent when large numbers of scheduled notifications were being processed
When attempting to send the email, an AWS Cognito TooManyRequests exception was thrown
Quartz did not handle the exception and retry the scheduled event later
Sentry did not alert us to the (reoccurrence of the) issue
Example exception:
2025-02-05T00:00:49.199Z ERROR 1 --- [eduler_Worker-6] org.quartz.core.JobRunShell : Job DEFAULT.PROGRAMME_DAY_ONE-581f69ee-724a-45c3-b63d-290f4665439c threw an unhandled Exception:
software.amazon.awssdk.services.cognitoidentityprovider.model.TooManyRequestsException: Too many requests (Service: CognitoIdentityProvider, Status Code: 400, Request ID: 3bb088cc-8a7d-4aab-a5ce-75471c8d2cbc)
Resolution
Added randomness to the scheduler to avoid spikes in notifications at midnight.
Missed notification emails were not resent, due to the long time period between the event and the detection of it rendering those notifications less valuable, if not liable to cause confusion.
Timeline
All times UTC unless otherwise indicated.
Feb 5, 2025 1303 PMs start, including 1213 in the rollout (i.e. excluding North East). 64 day-one notifications fail to be sent out.
Mar 5, 2025 LO queries apparently missed notifications to 3 trainees in Thames Valley
Mar 12, 2025 Randomisation added to scheduling to avoid this issue
Mar 12, 2025 Investigation into lack of Sentry alert begins
Action Items
Action Items | Owner |
|
---|---|---|
@Reuben Roberts / @Ogbeide Godstime Osemenkhian |
| |
Add queueing to scheduled notifications to better manage the throttling / retrying of failed notifications to be tasked in due course. |
|
|
|
|
|
See also:
Lessons Learned
Quartz retries did not work out-of- the-box for this exception, and Sentry did not alert us.
We may need to consider stress-testing our components if they are not protected from overload by a queueing system.
Related content
Slack: https://hee-nhs-tis.slack.com/
Jira issues: https://hee-tis.atlassian.net/issues/?filter=14213