2023-10-16 Duplicate notification emails sent
Date | Oct 16, 2023 |
Authors | @Reuben Roberts |
Status | Done |
Summary | The notification service for Form R updates was sending duplicate emails to some doctors. |
Impact | Over the course of 3 days, some doctors would have received a number of duplicate confirmation emails over a short period of time (~15min) when submitting their FormRs. |
Non-technical Description
When a FormR is submitted, an email is triggered notifying the PGDiT that their form has been received. Similar emails are sent if the LO un-submits or deletes the FormR. The queue that handles the list of messages that need to be sent was misconfigured, meaning that in some instances the same message was processed a number of times, resulting in duplicate emails to the PGDiT.
Trigger
The
tis-trainee-notifications-prod-form-updated.fifo
queue message visibility window was less than the time thetis-trainee-notification
s service takes to cache user accounts. As such, the queue would assume the message had failed to be processed, and make it available to be picked up by the service again. This could happen up to 10 times (until the message would be sent to the dead-letter queue, though this was not observed to occur).
Detection
A survey response complained of duplicate messages.
At most 122 doctors who received more than one email were affected (some legitimately submit both a FormR Part A and a Part B within a fairly short period of time, triggering two confirmations). 25 doctors were more definitely affected, having received 3 or more emails. The largest number of emails received by an individual doctor was 10, which occurred to two PGDiTs.
Resolution
Queue message visibility window was increased.
Timeline
All times in BST unless indicated
Oct 16, 2023: 15:02 - Notifications service functionality added to respond to FormR updates.
Oct 19, 2023: 18:11 - Survey response mentions duplicate emails.
Oct 19, 2023: 19:30 - Survey response noted by TSS team.
Oct 20, 2023: 08:45 - Fix applied to affected queues (form-updated.fifo and credential-revoked.fifo)
Root Cause(s)
Duplicate emails were sent by the notifications service.
Form update messages were being processed more than once.
The message visibility window on the form-updated queue was set to the default of 30 seconds, but the notification service takes longer than that to periodically cache user account details.
Messages were being requeued when their visibility window expired, meaning that they would be reprocessed.
Action Items
Action Items | Owner |
|
---|---|---|
|
|
|
|
|
|
Lessons Learned
This issue only manifested on production, because of the much larger user pool in that environment which affected the caching time.
Because cached users would not require the cache to be rebuilt, they would not experience the duplication, so it was not noticed in testing.
Better monitoring / sanity-checking of what emails are being sent might have highlighted the issue earlier.
Slack: https://hee-nhs-tis.slack.com/
Jira issues: https://hee-tis.atlassian.net/issues/?filter=14213