2023-09-01 ESR processing affected by TIS unavailability

Date

Sep 4, 2023

Authors

@Yafang Deng @Jayanta Saha @Steven Howard

Status

Done

Summary

 

Impact

Applicants and notifications were exported a few hours later than usual

Non-technical Description

There were a significant number of errors that occurred when we received files from ESR on Friday afternoon (573). This is because of an issue with the service that checks login details. We processed the files again to ensure all information had been received.

Trigger

  •  


Detection

There were 573 messages in the dlq until we got the notification in #monitoring-prod Slack channel and shovelled them.


Resolution

  • Processed the files a second time.


Timeline

BST unless otherwise stated

  • Sep 1, 2023 14:58 noticed the alert: “RabbitMQ PROD too many messages in the DLQ” in #monitoring-prod channel, then shovelled them to another queue “esr.dlq.2023.09.01“

  • Sep 4, 2023 10:32 had a huddle and looked into the messages in queue “esr.dlq.2023.09.01“

  • Sep 4, 2023 12 01 Processed files from 1st Sept (moved messages into the normal flow for processing)

Root Cause(s)

 

  • The quantity of messages in the dead letter queue showed a variety of types of messages failed. Messages which relied on enrichment from information in TCS seem to be the ones which failed.

  • TCS was busy but still functional. There were timeouts when TCS requested information about the user making the request.

  • Profile was experiencing a short spike in CPU usage.

 


Action Items

Action Items

Owner

 

Action Items

Owner

 

 

 

 

 

 

 


Lessons Learned

  •