2024-03-13 ESR processing affected by TIS unavailability

Date

Mar 13, 2024

Authors

@Steven Howard @Yafang Deng @Jayanta Saha

Status

Done

Summary

https://hee-tis.atlassian.net/browse/TIS21-5828

Impact

Applicants and notifications were exported a few hours later than usual

Non-technical Description

There were a significant number of errors that occurred when we received files from ESR on Wednesday afternoon. We processed the files again to ensure all information had been received before re-enabling the processes that generate applicant and notification information that goes to ESR.

Similar previous incidences

2023-09-01 ESR processing affected by TIS unavailability

2023-08-10 ESR processing affected by TIS unavailability

Trigger

  • Slack alerting


Detection

There were 2358 messages in the dlq until we got the notification in #monitoring-prod Slack channel and shovelled them.


Resolution

  • Processed the files a second time.


Timeline

BST unless otherwise stated

  • Mar 13, 2024 14:45 noticed the alert: “RabbitMQ PROD too many messages in the DLQ” in #monitoring-prod channel, then shovelled them to another queue “esr.dlq.2024.03.13“

  • Mar 13, 2024 14:50 messages still coming through to dlq, initially an additional 157 followed by a further 589

  • Mar 13, 2024 15:15 Noticed errors in sentry-esr channel - ResourceAccessException: I/O error on GET request

  • Mar 13, 2024 15:33 had a huddle and looked into the messages in queue “esr.dlq.2024.03.13“

  • Mar 13, 2024 15:45 observed one of the TCS nodes had deregistered.. Further investigation showed TCS had spiked at 100% CPU and memory - see attached

  • Mar 13, 2024 16:00 Agreed to wait until 17:00 and observe the dlq to check if any additional messages.

  • Mar 13, 2024 17:00 Processed RMC files from 13th March (moved messages into the normal flow for processing)

 

image-20240313-171354.png

 

Root Cause(s)

 

  • The quantity of messages in the dead letter queue showed a variety of types of messages failed. Messages which relied on enrichment from information in TCS seem to be the ones which failed.

  • TCS was busy but still functional. There were timeouts when TCS requested information about the user making the request.

  • Profile was experiencing a short spike in CPU usage.

 


Action Items


Lessons Learned

  •