/
2025-02-05 - TSS Email notifications not sent

2025-02-05 - TSS Email notifications not sent

 

Date

5 Feb 2025 (and other dates)

Authors

@Reuben Roberts

Status

Done

Summary

Some email notifications were not sent at periods of peak load

Impact

A small proportion of trainees would not have received their day-one programme or 12-week placement emailed notification

Non-technical Description

Email notifications are sent to trainee doctors at various points to remind them to conduct various onboarding tasks for programmes or placements. When these are sent, we use AWS to retrieve the correct email address from their TSS login account, since it is possible this may differ from the email address held by TIS. When very large numbers of emails need to be sent at the same time, AWS rejects attempts to retrieve the email address, since it limits these requests to at most 30 per second. In these cases, the email notification is not sent.


Trigger

Notifications scheduled at a fixed time (midnight UTC) for a large numbers of programmes / placements starting on the same day creating a spike in requests to AWS Cognito.


Detection

LO reported missing notifications.


5 Whys (or other analysis of Root Cause)

  • Some notification emails were not sent when large numbers of scheduled notifications were being processed

  • When attempting to send the email, an AWS Cognito TooManyRequests exception was thrown

  • Quartz did not handle the exception and retry the scheduled event later

  • Sentry did not alert us to the (reoccurrence of the) issue

Programmes starting 5 Feb 2025 (snapshot as of 10 Mar 2025, so spikes of programmes later in the year may still develop)

Example exception:

2025-02-05T00:00:49.199Z ERROR 1 --- [eduler_Worker-6] org.quartz.core.JobRunShell : Job DEFAULT.PROGRAMME_DAY_ONE-581f69ee-724a-45c3-b63d-290f4665439c threw an unhandled Exception:
software.amazon.awssdk.services.cognitoidentityprovider.model.TooManyRequestsException: Too many requests (Service: CognitoIdentityProvider, Status Code: 400, Request ID: 3bb088cc-8a7d-4aab-a5ce-75471c8d2cbc)


 

Resolution

  • Added randomness to the scheduler to avoid spikes in notifications at midnight.

  • Missed notification emails were not resent, due to the long time period between the event and the detection of it rendering those notifications less valuable, if not liable to cause confusion.


Timeline

All times UTC unless otherwise indicated.

  • Feb 5, 2025 1303 PMs start, including 1213 in the rollout (i.e. excluding North East). 64 day-one notifications fail to be sent out.

  • Mar 5, 2025 LO queries apparently missed notifications to 3 trainees in Thames Valley

  • Mar 12, 2025 Randomisation added to scheduling to avoid this issue

  • Mar 12, 2025 Investigation into lack of Sentry alert begins


Action Items

Action Items

Owner

 

Action Items

Owner

 

https://hee-tis.atlassian.net/browse/TIS21-7092

@Reuben Roberts / @Ogbeide Godstime Osemenkhian

 

Add queueing to scheduled notifications to better manage the throttling / retrying of failed notifications to be tasked in due course.

 

 

 

 

 

See also:


Lessons Learned

  • Quartz retries did not work out-of- the-box for this exception, and Sentry did not alert us.

  • We may need to consider stress-testing our components if they are not protected from overload by a queueing system.

Related content