Date	13 Mar 2024
Authors	Steven Howard Yafang Deng Jayanta Saha
Status	Done
Summary	https://hee-tis.atlassian.net/browse/TIS21-5828
Impact	Applicants and notifications were exported a few hours later than usual

Non-technical Description

There were a significant number of errors that occurred when we received files from ESR on Wednesday afternoon. We processed the files again to ensure all information had been received before re-enabling the processes that generate applicant and notification information that goes to ESR.

Similar previous incidences

2023-09-01 ESR processing affected by TIS unavailability

https://hee-tis.atlassian.net/wiki/spaces/NTCS/pages/3824386058/2023-08-10+ESR+processing+affected+by+TIS+unavailability?atlOrigin=eyJpIjoiMTE4MDRlNjg0ZGUyNGZiZWI2OWJmNzE2N2RjYTU5NjciLCJwIjoiY29uZmx1ZW5jZS1jaGF0cy1pbnQifQ

Trigger

Slack alerting

Detection

There were 2358 messages in the dlq until we got the notification in #monitoring-prod Slack channel and shovelled them.

Resolution

Processed the files a second time.

Timeline

BST unless otherwise stated

13 Mar 2024 14:45 noticed the alert: “RabbitMQ PROD too many messages in the DLQ” in #monitoring-prod channel, then shovelled them to another queue “esr.dlq.2024.03.13“
13 Mar 2024 14:50 messages still coming through to dlq, initially an additional 157 followed by a further 589
13 Mar 2024 15:15 Noticed errors in sentry-esr channel - ResourceAccessException: I/O error on GET request
13 Mar 2024 15:33 had a huddle and looked into the messages in queue “esr.dlq.2024.03.13“
13 Mar 2024 15:45 observed one of the TCS nodes had deregistered.. Further investigation showed TCS had spiked at 100% CPU and memory - see attached
13 Mar 2024 16:00 Agreed to wait until 17:00 and observe the dlq to check if any additional messages.
13 Mar 2024 17:00 Processed RMC files from 13th March (moved messages into the normal flow for processing)

Root Cause(s)

The quantity of messages in the dead letter queue showed a variety of types of messages failed. Messages which relied on enrichment from information in TCS seem to be the ones which failed.
TCS was busy but still functional. There were timeouts when TCS requested information about the user making the request.
Profile was experiencing a short spike in CPU usage.

Action Items

Action Items	Owner
Create a ticket to investigate shortcircuit exception for profile service.	Jayanta Saha	To create ticket

2024-03-13 ESR processing affected by TIS unavailability