Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 4 Current »

Date

Authors

Steven Howard Yafang Deng Jayanta Saha

Status

Done

Summary

https://hee-tis.atlassian.net/browse/TIS21-5828

Impact

Applicants and notifications were exported a few hours later than usual

Non-technical Description

There were a significant number of errors that occurred when we received files from ESR on Wednesday afternoon. We processed the files again to ensure all information had been received before re-enabling the processes that generate applicant and notification information that goes to ESR.

Similar previous incidences

2023-09-01 ESR processing affected by TIS unavailability

https://hee-tis.atlassian.net/wiki/spaces/NTCS/pages/3824386058/2023-08-10+ESR+processing+affected+by+TIS+unavailability?atlOrigin=eyJpIjoiMTE4MDRlNjg0ZGUyNGZiZWI2OWJmNzE2N2RjYTU5NjciLCJwIjoiY29uZmx1ZW5jZS1jaGF0cy1pbnQifQ

Trigger

  • Slack alerting


Detection

There were 2358 messages in the dlq until we got the notification in #monitoring-prod Slack channel and shovelled them.


Resolution

  • Processed the files a second time.


Timeline

BST unless otherwise stated

  • 14:45 noticed the alert: “RabbitMQ PROD too many messages in the DLQ” in #monitoring-prod channel, then shovelled them to another queue “esr.dlq.2024.03.13“

  • 14:50 messages still coming through to dlq, initially an additional 157 followed by a further 589

  • 15:15 Noticed errors in sentry-esr channel - ResourceAccessException: I/O error on GET request

  • 15:33 had a huddle and looked into the messages in queue “esr.dlq.2024.03.13“

  • 15:45 observed one of the TCS nodes had deregistered.. Further investigation showed TCS had spiked at 100% CPU and memory - see attached

  • 16:00 Agreed to wait until 17:00 and observe the dlq to check if any additional messages.

  • 17:00 Processed RMC files from 13th March (moved messages into the normal flow for processing)

image-20240313-171354.png

Root Cause(s)

  • The quantity of messages in the dead letter queue showed a variety of types of messages failed. Messages which relied on enrichment from information in TCS seem to be the ones which failed.

  • TCS was busy but still functional. There were timeouts when TCS requested information about the user making the request.

  • Profile was experiencing a short spike in CPU usage.


Action Items

Action Items

Owner

Create a ticket to investigate shortcircuit exception for profile service.

Jayanta Saha

To create ticket


Lessons Learned

  • No labels