2024-06-17 RabbitMQ ran out of memory

Date

May 17, 2024

Authors

@Cai Willis

Status

https://hee-tis.atlassian.net/browse/TIS21-6217

Summary

RabbitMQ ran out of memory, affecting TIS services (particularly Reval)

Impact

Any data updated on GMC day before has not/will not reflect on our system

Non-technical Description

The system we use to send messages between our services ran out of memory. This led to a delay in updates across a number of TIS services, in particular Reval.


Trigger

RabbitMQ ran out of memory


Detection

Inspection of lastUpdatedDate column by developer


Resolution

Restarted Neo4j Consumers and managed syncing of queues until resources were cleared


Timeline

  • Jun 16, 2024 21:23 Recommendations service reports failure to connect to rabbitmq

  • Jun 17, 2024 09:41 Issue identified on revalidation by developer

  • Jun 17, 2024 ~10:00 Restarted TCS - recreated queues

  • Jun 17, 2024 10:42 Revalidation doctors list restored, users notified

5 Whys (or other analysis of Root Cause)

Why was the doctors list not updated? Because the overnight sync failed to run
Why did the overnight sync fail to run? because the message to start the sync job was “stuck”
Why was the message to start the sync job stuck? because there was not enough available memory in the rabbitmq cluster and no new messages were being consumed

Why was there not enough memory in the rabbitmq cluster? because the neo-audit queue does not get processed quickly enough (1,000,000+ messages in the broker at all times) -other reasons?
Why does the neo-audit queue not get processed quickly enough? because it is running on old infrastructure and only has a couple of very slow instances consuming it


Action Items

Action Items

Owner

Comments

Action Items

Owner

Comments

Complete this ticket

@Cai Willis

 

 

 

 

 

 

 

 

 

 

 


Lessons Learned

  • Tech improvement tickets are important