Done

Date	05 Apr 2024
Authors	Cai Willis Yafang Deng Jayanta Saha Steven Howard
Status	Documenting
Summary	the system is really slow and when I try to select records they don't open sometimes and the error below has been appearing for a few days
Impact	Users were unable to retrieve information as a result slowness from Recommendation and Connection detail list as the system was reported slow.

Table of Contents

Non-technical Description

The message broker is a service that helps to connect systems without needing them all to be 100% available and responding instantly all of the time.

The broker has several coordinated computers (nodes) to ensure it functions properly. In the weekly maintenance period, there were so many “in-flight” messages that when one of the nodes was replaced and needed to catch up with the other two, it used too much of the memory from those 2 remaining nodes. That placed the broker into a protective “quarantine” mode, allowing messages to be read but restricting the publishing of messages.

We use the message broker for most of how we process information to and from ESR, also for some of how we pull information from the GMC and for recording when Conditions of Joining are signed by trainees.

...

Trigger

Automated maintenance with large queue

Detection

Alerts in monitoring channel

Resolution

Stopped the synchronisation across nodes in the cluster
Horizontal scaling consumers of queues with lots of message

Timeline

...

11 Feb 2024 22:00-23:59 - Message Broker maintenance window

...

11 Feb 2024 22:50 - Broker nodes begin being restarted

...

11 Feb 2024 22:56 - One of the secondary nodes becomes unreachable from the primary

...

11 Feb 2024 23:28 - Primary and 1 secondary enter an “Memory Alarm”state, although memory reported for the broker did not reflect this

...

12 Feb 2024 08:41 - First of 3 failed attempts of a trainee to digitally sign conditions of joining

...

12 Feb 2024 08:45 - Picked up alert and began identifying the issue. We paused some of the feeds in order to free up some

...

12 Feb 2024 12:11 - Cancelled Sync of largest queue

...

12 Feb 2024 12:13 - Other queue finished synchronising

...

Revalidation system is really slow when users tries to select records they don't open sometimes an error message occurs" oops something went wrong" which has been appearing for a few days. A user in a different region states it is clearly a speed / data retrieval issue. Three users from three different had raised this issue via the UR Teams Channel.(North West, South East and South West).

...

Trigger

Core service was failing but unclear of the cause.

...

Detection

User raised the issue on Teams

...

Resolution

We restarted the core service and all appeared well soon after.

...

Timeline

05 Apr 2024 10:04 - Message received from user to inform that the application is slow followed with error messages
08 Apr 2024 10:06 - First responder informed that the Team did work on this and requested if users could report if performance is still impacted.
08 Apr 2024 15:22 - another user from a different Region reported still having the same issue
08 Apr 2024 15:25 - User raised “clearly a speed / data retrieval issue - we are having this problem also, even using a direct URL is incredibly slow”.
10 Apr 202411:00 - Further investigation was commenced to establish the cause of the issue
10 Apr 202412:00- We restarted the Core service in Production
10 Apr 202412:34 - No responses or timeout since

5 Whys (or other analysis of Root Cause)

We received “connection timeout” alerts because the message broker was unavailable.
The broker was in a “Memory Quarantine State”
There was a node started and required synchronising over 1.5M messages
Message broker underwent maintenance (restarts/replacement) with too many messages queued up
2 microservices with audit queues had stopped consuming messages.

Q. Why were users unable to load doctor details reliably?
A. The application was intermittently slow/timing out when clicking on individual doctors on the list page

Q. Why was the application slow/timing out?
A. There was a (~2 min) delay between the request for the doctor details and the request to retrieve doctor notes, (resulting in timeout errors reported in X-ray(?))

Why was there a a (~2 min) delay between the request for the doctor details and the request to retrieve doctor notes?

...

Action Items

Action Items	Owner	Comments
Simulate the failure / test pre-emptive alerting	Joseph (Pepe) Kelly	Mini-hack PoC prep
Rearchitect: Splitting the messaging to limit the impact of spread? SQS instead of RabbitMQ Make neoaudit containers: re-establish connection prefetch / batch consuming PoC to show how flow control would be implemented	Mini-hack PoCs & improvements to python scripts: e.g. time-based scaling of audit consumers (based on when messages are usually published)	Alert on a certain queue sizeTo investigate why core service action was failing	Cai Willis
Monitoring alarm to be set up	Cai Willis

...

Version	Old Version 1	New Version Current
Changes made by	catherine.odukale (Unlicensed)	Steven Howard
Saved on	10 Apr 2024	15 Apr 2024

Versions Compared

Key

Non-technical Description

Trigger

Detection

Resolution

Timeline

Trigger

Detection

Resolution

Timeline

5 Whys (or other analysis of Root Cause)

Action Items

Lessons Learned

Page Comparison

Versions Compared

Key

Non-technical Description

Trigger

Detection

Resolution

Timeline

Trigger

Detection

Resolution

Timeline

5 Whys (or other analysis of Root Cause)

Action Items

Lessons Learned