2024-12-12 Email Not Sent

Date

Dec 12, 2024

Authors

@Doris.Wong

Status

Done

Summary

Investigate SCHEDULED history records in the DB while the sentAt date is past. Some of them due to unhandled exceptions, some of the notifications are left as orphan when the Quartz trigger were processed.

Impact

Inaccurate notifications history record shown in TIS admins-UI; Hundreds of notifications with Cognito exception might not been sent out

Non-technical Description

It is the investigation on the email notification with past sent date are found still having a SCHEDULED status.

We have found out two main reason for this.

 

The first one is the schedule for some of the email is cancelled as it is not in the pilot, but the notification history data was not update. It didn't impact the email that actually sent out to the trainee.  But this make a discrepancy between the notifications history shown in TIS admins-UI, and the emails that actually sent out.

We have created a ticket for fixing the orphan SCHEDULED notification records

 

The second issue is  due to the request to Cognito is excess its limit when email are sending out. There were about 5 hundreds of email getting this exception. Most of them are the Placement notice 12 weeks before the start day. We can confirm that the system would  would re-attempt to send the failed notification to the trainee again, some of them may be delivery successfully, but it is hard to know the figure on our side.
Ticket has been created so we will work to fix this issues in the coming iterations.

 

 


Trigger

SCHEDULED Orphan

When a job is determined as “false” in shouldActuallySendEmail (for example if it is not in the pilot), the notification would not be actually sent out to the trainee. (code) The trigger is consumed in Quartz. but the SCHEDULED notification is left orphan in the DB. The SCHEDULED notification record is not deleted even when the placement/programme is updated later since the sentAt date is past when delete scheduled notifications from DB.

 

Cognito Exception

Unhandled exception is thrown when trying to execute the job. This exception is thrown when the user has made too many requests for a given operation to Cognito. (code)

There where around 5 hundreds of “TooManyRequestsException” occurrences found when executing scheduled job form the log insight, but it is hard to know if the notifications were sent out actually unless we have capture the delivery email.

image-20241212-144228.png

Detection

Email notification with past sent date are found still having a SCHEDULED status.

 


5 Whys (or other analysis of Root Cause)

SCHEDULED Orphan

  • The schedule for some of the email is cancelled as it is not in the pilot, but the notification history data was not update.

  • When the placement/programme is updated later, as the sentAt date is past when delete scheduled notifications from DB, the SCHEDULED notification record could not be deleted

Cognito Exception

  • A large number of scheduled job is executed at mid night triggered in one time

  • It produce too many requests to Cognito for getting the trainee account details

  • getCognitoAccountDetails() is called in executeNow() and enrichJobDetails() that would have doubled the request to Cognito

  • The exception is not caught and Quartz keep retrying and increase the amount of requests

 


Resolution

  • Make sure notifications in History DB is removed when a job is executed but determined not to be sent

  • Remove historical SCHEDULED orphan

  • Improve log message to make it more descriptive and avoid using “sent” when the job is only “executed” but not really sent

  • Fix findAllScheduledForTrainee() to filter with SCHEDULED status for email notifications (so past SCHEDULED notification can be deleted)


Timeline

All times GMT unless otherwise indicated.

  • Oct 28, 2024 The team found a number of notifications in Mongo with a past sentAt date, and a status of SCHEDULED

  • In Dec, the ticket is picked up and investigated

  • Dec 16, 2024 3-amigo in the team to discuss the findings

  • Dec 20, 2024 Discussion within the team for the solutions and refined the follow up ticket


Action Items

Action Items

Owner

 

Action Items

Owner

 

Fix the code and remove SCHEDULED orphan from DB (with separate ticket)

@Doris.Wong

https://hee-tis.atlassian.net/browse/TIS21-6800

 

 

 

 

 

 

See also:


Lessons Learned

  • Should keep close eyes to the Sentry alert to identify the potential issues in early stage