Date	16 23 Oct 2023
Authors Reuben Roberts	Cai Willis Steven Howard
Status	Done
Summary	The notification service for Form R updates was sending duplicate emails to some doctors.
Impact	Over the course of 3 days, some doctors would have received a number of duplicate confirmation emails over a short period of time (~15min) when submitting their FormRs.

Table of Contents

Non-technical Description

When a FormR is submitted, an email is triggered notifying the PGDiT that their form has been received. Similar emails are sent if the LO un-submits or deletes the FormR. The queue that handles the list of messages that need to be sent was misconfigured, meaning that in some instances the same message was processed a number of times, resulting in duplicate emails to the PGDiT.

Trigger

The tis-trainee-notifications-prod-form-updated.fifo queue message visibility window was less than the time the tis-trainee-notifications service takes to cache user accounts. As such, the queue would assume the message had failed to be processed, and make it available to be picked up by the service again. This could happen up to 10 times (until the message would be sent to the dead-letter queue, though this was not observed to occur).

Detection

A survey response complained of duplicate messages.
At most 122 doctors who received more than one email were affected (some legitimately submit both a FormR Part A and a Part B within a fairly short period of time, triggering two confirmations). 25 doctors were more definitely affected, having received 3 or more emails. The largest number of emails received by an individual doctor was 10, which occurred to two PGDiTs.

Resolution

Queue message visibility window was increased.

Jira Legacy

server	System JIRA
serverId	4c843cd5-e5a9-329d-ae88-66091fcfe3c7
key	TIS21-5261

Impact

Doctors were not updated and reval was not available for some regions for the whole day. The updates happened sequentially, as below.
1-1RSG4X0 Yorkshire and The Humber
1-1RSSPZ7 East Midlands
1-1RSSQ05, East of England
1-1RSSQ1B North East
1-1RSSQ2H North West 1200 confirmed docs available
1-1RSSQ5L South London
1-1RSSQ6R Thames Valley
1-1RUZUSF Wessex
1-1RUZUVB South West 1307 reported docs still missing
1-1RUZUYF West Midlands
1-1RUZV1D Kent, Surrey and Sussex
1-1RUZV4H North Central East London
1-1RUZV6H North West London

Table of Contents

Non-technical Description

The overnight job to synchronize connected doctors from the GMC into our system normally runs at midnight until the early morning. Today it did not run until about 10am, which means the various doctor lists in the revalidation app were not updated.

This occurred because our messaging system failed, so the relevant part of our system did not get the message to start the synchronization until far later than it should have

...

Trigger

A large volume of messages in RabbitMQ caused an our of memory error, preventing publishers from posting messages to their queues.

Detection

Some DLQ monitoring alerts were flagged in the morning, and a developer spotted the lack of doctors on the revalidation application

...

Resolution

The sync process had already started up again on its own, so there was no need to actively change anything, although one obsolete queue was purged of a high volume of messages and one service had to be restarted as it had lost connection with RabbitMQ

...

Timeline

All times in BST unless indicated

16 22 Oct 2023: 15:02 - Notifications service functionality added to respond to FormR updates.
19 Oct 2023: 18:11 - Survey response mentions duplicate emails.
19 Oct 2023: 19:30 - Survey response noted by TSS team.
20 Oct 2023: 08:45 - Fix applied to affected queues (form-updated.fifo and credential-revoked.fifo23:56 Restarting Consumer logs appear from recommendation service
22 Oct 2023: 23:56 Suspended all listeners and will no longer accept client connections logs appear from AWS RabbitMQ broker
23 Oct 2023: 00:01memory resource limit alarm set on node and Publishers will be blocked until this alarm clears logs appear from AWS RabbitMQ broker
23 Oct 2023: 00:05 Start message has been sent to start gmc sync log appears in recommendation service, but this message is not delivered until much later
23 Oct 2023: 09:44-ish Messages shovelled from ESR DLQ, possibly freeing up a little space?
23 Oct 2023: 09:52 GMC Client Service receives a message to start the doctor sync, and the sync begins.4
23 Oct 2023: 11:00-ish reval.queue.gmcsync.connection purged of 1.6 million messages
24 Oct 2023: Functionality restored (overnight sync job ran successfully)

Root Cause(s)

...

Duplicate emails were sent by the notifications service.

...

Form update messages were being processed more than once.

...

The message visibility window on the form-updated queue was set to the default of 30 seconds, but the notification service takes longer than that to periodically cache user account details.

...

5 Whys

Why didn’t the GMC Overnight Sync Job Start on time? - Because RabbitMQ ran out of memory
Why did RabbitMQ run out of memory? - Because of millions of messages in the following queues: esr.queue.audit.neo, tis.queue.cdc.created, reval.queue.gmcsync.connection (is this the real reason?)
Why were there millions of messages in these queues? - NeoAudit containers have a habit of dropping connection to RabbitMQ, CDC rabbit router was unable to consume/publish messages from the queue
Why do the containers drop connection to RabbitMQ? - This requires an investigation, it’s a not uncommon issue where RabbitMQ and ECS drop connections without an obvious cause

Causes

RabbitMQ ran out of memory
esr.queue.audit.neo and tis.queue.cdc.created had millions of messages (recorded at about 10:30 am) (Active queues)
reval.queue.gmcsync.connection had 1.6 million messages (Obsolete unused queue)

Action Items

Action Items	Owner

Lessons Learned

...

This issue only manifested on production, because of the much larger user pool in that environment which affected the caching time.

...

Because cached users would not require the cache to be rebuilt, they would not experience the duplication, so it was not noticed in testing.

...

Unbind and delete the queue reval.queue.gmcsync.connection (It’s not currently used by any application), also remove references to reval.queue.masterdoctorview.updated.connection in init.json

Cai Willis

DONE

Investigate why NeoAudit lost connection to RabbitMQ (Ticket)

Cai Willis

Ticket Created:

Jira Legacy

server	System JIRA
serverId	4c843cd5-e5a9-329d-ae88-66091fcfe3c7
key	TIS21-5282

Set up alerts for RabbitMQ low memory (Ticket)

Cai Willis

Ticket Created:

Jira Legacy

server	System JIRA
serverId	4c843cd5-e5a9-329d-ae88-66091fcfe3c7
key	TIS21-5283

(Nice to have) Broader RabbitMQ health check alerting

...

Lessons Learned

Don’t leave obsolete queues lying around - especially when they’re still being published to
There’s no such thing as too much alerting

Version	Old Version 1	New Version Current
Changes made by	catherine.odukale (Unlicensed)	Cai Willis
Saved on	23 Oct 2023	31 Oct 2023

Versions Compared

Key

Non-technical Description

Trigger

Detection

Resolution

Non-technical Description

Trigger

Detection

Resolution

Timeline

Root Cause(s)

5 Whys

Causes

Action Items

Lessons Learned

Lessons Learned

Page Comparison

Versions Compared

Key

Non-technical Description

Trigger

Detection

Resolution

Non-technical Description

Trigger

Detection

Resolution

Timeline

Root Cause(s)

5 Whys

Causes

Action Items

Lessons Learned

Lessons Learned