2024-11-07 Failed email notifications

Date

Nov 7, 2024

Authors

@Doris.Wong

Status

Done

Summary

Investigate failed emails with different reason. Solve the “no email address available” issue to make sure trainee emails are synced from TIS and email can be sent; Find out the clause of large number of bounce email in the report, raising follow-up to fix that.

Impact

Emails were not being sent (recorded as ‘FAILED’) since TSS had no record of some new trainee email addresses; And actual resend status of bounced email not updated in the report

Non-technical Description

We had an investigation in the failed email list from London & KSS.
The emails mainly failed with “permanent” / “transient“ bounce, while some are failed with “no email address available“ reasons.

 

The “no email address available“ issue is covered and fixed in another incident ticket: 2024-11-07 New trainee profiles not triggering a full TSS data sync

(Copied from the related incident) When a trainee profile is created for the first time, TSS sends a “DataRequest” message to TIS to trigger a full data sync for the matching Person record. However, the TSS sync service has been failing to process those messages since August 2024. As a consequence, these new profiles may be missing key information that is needed to populate and send onboarding notifications, e.g. trainee email address, since TSS would otherwise only get this information if it happened to be updated in TIS after the trainee profile was created. A large number of recent notifications are recorded as ‘failed’ due to a missing email address.

 

On the other hand, the email would be transient bounced when temperate issue occur.The transient bounced email would be resent by AWS SES by default.

We have stored the failed records in the DB when it failed to deliver, but didn’t have the process to update the status when email is resent successfully. As a result, the report for the notifications may not be showing the real status for email delivery. To fix this, we have created a follow up ticket to update resend delivery status in notification history, so as to make sure the DB records and reports are accurate in the future.

 


Trigger

No email address available:
(Copied from the related incident) A version update of a library component used to send and receive messages (spring-cloud-aws upgraded from 2.4.x to 3.1.x, which we implemented in Dec 2023) introduced a deserialization message attribute with a class from the tis-trainee-details service (JavaType=uk.nhs.hee.trainee.details.event.ProfileCreateEvent)which the SQS listener in tis-trainee-sync then unsuccessfully attempted to use to deserialize the message, since it does not have access to that class.

Permanent bounced email:
Permanent bounce happened with General or OnAccountSuppressionList reasons in our email notifications.

Transient bounced email:
Transient bounce happened with General, MailboxFull or ContentRejected reasons in our email notifications.

When a soft bounce (a bounce related to a temporary issue, such as the recipient's inbox being full) occurs, Amazon SES attempts to redeliver the email for a certain period of time. At the end of that period of time, if Amazon SES still can't deliver the email, it stops trying.
Currently, when the notification emails are failed with bounce/complaint, the failure events will be published through the SNS topic (arn:aws:sns:eu-west-2:430723991443:tis-trainee-ses-bounce-complaint-event). The tis-trainee-notification service will listen and store the failure status in the notification history collection. But we don’t capture the resend result and update the DB at this moment. So the successfully resent email may still counted as failed in the report.

 

Please find the detail description of each bounce type from the following link: https://docs.aws.amazon.com/ses/latest/dg/notification-contents.html#:~:text=unknown%20user%22%0A%7D-,Bounce%20types,-The%20bounce%20object


Detection

It is the feedback from the Trainee review meeting on 7 Nov 2024 reported by Stuart from London.

 


5 Whys (or other analysis of Root Cause)

No email address available: (Copied from the related incident)

  • Emails were not sent because no trainee email address was available in TSS.

  • No trainee email address was available in TSS because it had not been sync’d from TIS.

  • Data had not been sync’d from TIS because no full-data request had been made from TSS, and the records in question had not been updated since the profile was created, which would have triggered a normal sync.

  • No full-data request had been made from TSS because the TSS sync service could not deserialize the messages instructing it to make the request.

  • The TSS sync service could not deserialize the messages because of a new message attribute was instructing it to deserialize to a non-available class.

Transient Bounce emails:

  • Email cannot be delivered because of transient reason like General, MailboxFull or ContentRejected.

  • Failed records are saved in the DB when it failed to deliver, but didn’t have the process to update the status when email is resent successfully.

 


Resolution

  • For “no meial address available“: tis-trainee-details configured to omit the JavaType attribute in messages it sends.

  • For transient bounce number: Create a ticket to update resend delivery status in notification history, so as to make sure the DB records and reports are accurate


Timeline

All times GMT unless otherwise indicated.

  • Nov 7, 2024 10:30 As part of the discussion around https://hee-tis.atlassian.net/browse/TIS21-6673 , it was mentioned that unusually large numbers of emails for trainee onboarding were being recorded as ‘failed’ due to no email address, but that the trainees in question had email addresses in TIS.

  • Nov 7, 2024 ~21:00 redrive all production data to ensure trainees have email address and other profile data populated.

  • Nov 12, 2024 ~13:00 fix to tis-trainee-details to remove JavaType attribute of profile-created messages

  • Nov 21, 2024 ~16:30 data are still missing from the trainee profiles. redrive trainee profile data to ensure trainees email address and other profile data are populated.

  • Nov 27, 2024 Create a ticket to update resend delivery status in notification history https://hee-tis.atlassian.net/browse/TIS21-6746


Action Items

Action Items

Owner

 

Action Items

Owner

 

Fix message attributes for no email available issues

@Reuben Roberts

Done - https://github.com/Health-Education-England/tis-trainee-details/pull/494

Redrive production data to populate trainee email addresses

@Reuben Roberts , @Doris.Wong

Done

Create a ticket to update resend delivery status in notification history, so as to make sure the DB records and reports are accurate

@Doris.Wong

Done - https://hee-tis.atlassian.net/browse/TIS21-6746

See also:


Lessons Learned

  • Make sure to update status in DB to keep an actual record even when the update is done by third-party applications