Date	10 Aug 2023
Authors	Joseph (Pepe) Kelly Yafang Deng Jayanta Saha
Status	In ProgressDocumenting
Summary
Impact	Applicants and notifications were exported a few hours later than usual

Table of Contents

Non-technical Description

There were a significant number of errors that occurred when we received files from ESR on Thursday afternoon. This is because of an issue with the service that checks login details. We processed the files again to ensure all information had been received before re-enabling the processes that generate applicant and notification information that goes to ESR.

Trigger

Slack alerting

...

Detection

There were 2274 messages in the dlq until we got the notification in #monitoring-prod Slack channel and shovelled them.

...

Resolution

Processed the files a second time.

...

Timeline

BST unless otherwise stated

10 Aug 2023 14:34 noticed the alert: “RabbitMQ PROD too many messages in the DLQ” in #monitoring-prod channel, then shovelled them to another queue “esr.dlq.2023.08.10“
11 Aug 2023 10:32 had a huddle and looked into the messages in queue “esr.dlq.2023.08.10“
11 Aug 2023 ~14:32 (13:32 UTC) CPU for profile spiked
11 Aug 2023~14:30 After confirming that we have never generated 2 applicant records for any placement against the same position, manually triggered file processing from 10th Aug.
11 Aug 2023 17:45 Processed files from 11th Aug (moved messages into the normal flow for processing)

Root Cause(s)

The quantity of messages in the dead letter queue showed a variety of types of messages failed. Messages which relied on enrichment from information in TCS seem to be the ones which failed.
TCS was busy but still functional. There were timeouts when TCS requested information about the user making the request.
Profile was experiencing a short spike in CPU usage.

...

Action Items

Action Items	Owner
Extend logging in the profile service	Joseph (Pepe) Kelly	Done. There is a significant of additional logs but we should be able to extract logs over a few days to identify relative use across tasks.
Reduce the demand from TCS by trusting cached responses for longer	Joseph (Pepe) Kelly	Done.

...

Versions Compared

Old Version 4

New Version 5

Key

Non-technical Description

Trigger

Detection

Resolution

Timeline

Root Cause(s)

Action Items

Lessons Learned

Page Comparison

Versions Compared

Old Version 4

New Version 5

Key

Non-technical Description

Trigger

Detection

Resolution

Timeline

Root Cause(s)

Action Items

Lessons Learned