Date	5 Feb 2025 (and other dates)
Authors	@Reuben Roberts
Status	Done
Summary	Some email notifications were not sent at periods of peak load
Impact	A small proportion of trainees would not have received their day-one programme or 12-week placement emailed notification

Non-technical Description

Email notifications are sent to trainee doctors at various points to remind them to conduct various onboarding tasks for programmes or placements. When these are sent, we use AWS to retrieve the correct email address from their TSS login account, since it is possible this may differ from the email address held by TIS. When very large numbers of emails need to be sent at the same time, AWS rejects attempts to retrieve the email address, since it limits these requests to at most 30 per second. In these cases, the email notification is not sent.

Trigger

Notifications scheduled at a fixed time (midnight UTC) for a large numbers of programmes / placements starting on the same day creating a spike in requests to AWS Cognito.

Detection

LO reported missing notifications.

5 Whys (or other analysis of Root Cause)

Some notification emails were not sent when large numbers of scheduled notifications were being processed
When attempting to send the email, an AWS Cognito TooManyRequests exception was thrown
Quartz did not handle the exception and retry the scheduled event later
Sentry did not alert us to the (reoccurrence of the) issue

Programmes starting 5 Feb 2025 (snapshot as of 10 Mar 2025, so spikes of programmes later in the year may still develop)

Example exception:

2025-02-05T00:00:49.199Z ERROR 1 --- [eduler_Worker-6] org.quartz.core.JobRunShell : Job DEFAULT.PROGRAMME_DAY_ONE-581f69ee-724a-45c3-b63d-290f4665439c threw an unhandled Exception:
software.amazon.awssdk.services.cognitoidentityprovider.model.TooManyRequestsException: Too many requests (Service: CognitoIdentityProvider, Status Code: 400, Request ID: 3bb088cc-8a7d-4aab-a5ce-75471c8d2cbc)

Resolution

Added randomness to the scheduler to avoid spikes in notifications at midnight.
Missed notification emails were not resent, due to the long time period between the event and the detection of it rendering those notifications less valuable, if not liable to cause confusion.

Timeline

All times UTC unless otherwise indicated.

Feb 5, 2025 1303 PMs start, including 1213 in the rollout (i.e. excluding North East). 64 day-one notifications fail to be sent out.
Mar 5, 2025 LO queries apparently missed notifications to 3 trainees in Thames Valley
Mar 12, 2025 Randomisation added to scheduling to avoid this issue
Mar 12, 2025 Investigation into lack of Sentry alert begins

Action Items

Action Items	Owner

Action Items	Owner
https://hee-tis.atlassian.net/browse/TIS21-7092	@Reuben Roberts / @Ogbeide Godstime Osemenkhian
Add queueing to scheduled notifications to better manage the throttling / retrying of failed notifications to be tasked in due course.

Lessons Learned

Quartz retries did not work out-of- the-box for this exception, and Sentry did not alert us.
We may need to consider stress-testing our components if they are not protected from overload by a queueing system.

TIS21 Confluence Space

2025-02-05 - TSS Email notifications not sent