2024-12-12 Email Not Sent
Date | Dec 12, 2024 |
Authors | @Doris.Wong |
Status | Done |
Summary | Investigate SCHEDULED history records in the DB while the |
Impact | Inaccurate notifications history record shown in TIS admins-UI; Hundreds of notifications with Cognito exception might not been sent out |
Non-technical Description
It is the investigation on the email notification with past sent date are found still having a SCHEDULED status.
We have found out two main reason for this.
The first one is the schedule for some of the email is cancelled as it is not in the pilot, but the notification history data was not update. It didn't impact the email that actually sent out to the trainee. But this make a discrepancy between the notifications history shown in TIS admins-UI, and the emails that actually sent out.
We have created a ticket for fixing the orphan SCHEDULED notification records
The second issue is due to the request to Cognito is excess its limit when email are sending out. There were about 5 hundreds of email getting this exception. Most of them are the Placement notice 12 weeks before the start day. We can confirm that the system would would re-attempt to send the failed notification to the trainee again, some of them may be delivery successfully, but it is hard to know the figure on our side.
Ticket has been created so we will work to fix this issues in the coming iterations.
Trigger
SCHEDULED Orphan
When a job is determined as “false” in shouldActuallySendEmail (for example if it is not in the pilot), the notification would not be actually sent out to the trainee. (code) The trigger is consumed in Quartz. but the SCHEDULED notification is left orphan in the DB. The SCHEDULED notification record is not deleted even when the placement/programme is updated later since the sentAt date is past when delete scheduled notifications from DB.
Cognito Exception
Unhandled exception is thrown when trying to execute the job. This exception is thrown when the user has made too many requests for a given operation to Cognito. (code)
There where around 5 hundreds of “TooManyRequestsException” occurrences found when executing scheduled job form the log insight, but it is hard to know if the notifications were sent out actually unless we have capture the delivery email.
Detection
Email notification with past sent date are found still having a SCHEDULED status.
5 Whys (or other analysis of Root Cause)
SCHEDULED Orphan
The schedule for some of the email is cancelled as it is not in the pilot, but the notification history data was not update.
When the placement/programme is updated later, as the sentAt date is past when delete scheduled notifications from DB, the SCHEDULED notification record could not be deleted
Cognito Exception
A large number of scheduled job is executed at mid night triggered in one time
It produce too many requests to Cognito for getting the trainee account details
getCognitoAccountDetails()
is called inexecuteNow()
andenrichJobDetails()
that would have doubled the request to CognitoThe exception is not caught and Quartz keep retrying and increase the amount of requests
Resolution
Make sure notifications in History DB is removed when a job is executed but determined not to be sent
Remove historical SCHEDULED orphan
Improve log message to make it more descriptive and avoid using “sent” when the job is only “executed” but not really sent
Fix
findAllScheduledForTrainee()
to filter with SCHEDULED status for email notifications (so past SCHEDULED notification can be deleted)
Timeline
All times GMT unless otherwise indicated.
Oct 28, 2024 The team found a number of notifications in Mongo with a past
sentAt
date, and a status ofSCHEDULED
In Dec, the ticket is picked up and investigated
Dec 16, 2024 3-amigo in the team to discuss the findings
Dec 20, 2024 Discussion within the team for the solutions and refined the follow up ticket
Action Items
Action Items | Owner |
|
---|---|---|
Fix the code and remove SCHEDULED orphan from DB (with separate ticket) | @Doris.Wong | |
|
|
|
|
|
|
See also:
Lessons Learned
Should keep close eyes to the Sentry alert to identify the potential issues in early stage
Slack: https://hee-nhs-tis.slack.com/
Jira issues: https://hee-tis.atlassian.net/issues/?filter=14213