Documenting

Date

22 Date

26 Jun 2023

Authors

Yafang Deng Jayanta Saha John Simmons (Deactivated) James Harris catherine.odukale (Unlicensed)

Status

Done

Summary

Jira Legacy

server	System JIRA
serverId	4c843cd5-e5a9-329d-ae88-66091fcfe3c7
key	TIS21-46924707

Impact

Delayed updates to TIS and TSS with the trainee and position information from ESR, and TIS delayed exporting outbound files to ESR.

Table of Contents

Non-technical Description

An issue occurred with an overnight task which meant users were only seeing one trainee in the revalidation search lists. It was continously restarting and retrying until we were able to give it additional resources to work with. It then was able to run and fill out the search lists for recommendations and connections.

Trigger

As part of the overnight sync job, on calling the GMC’s SOAP endpoint GetDoctorsForDB our service gmc-client-service experienced an out of memory error and crashed
The message to trigger the sync job remained queued, and presumably kept re-triggering the error every time ECS spun up a new task

Detection

Slack monitoring channel
- Also reported & impact confirmed by users

Resolution

...

The TIS-ESR interface works with certain files being sent and received at specific times. We observed an yesterday that stopped some of these files being processed when they should have been. In order to not miss next time frame for these files we needed to ensure these files were processed before the next day’s inbound files.

The services that store information failed and a number of files were not processed. The built in alerting notified the team and after verifying the status of a number of failed individual transactions, we resolved the immediate problem and resent the instructions to process the files listed below.

in/DE_EMD_RMC_20230626_00003445.DAT
in/DE_EOE_RMC_20230626_00003645.DAT
in/DE_KSS_APC_20230625_00012753.DAT
in/DE_KSS_RMC_20230626_00003635.DAT
in/DE_LDN_APC_20230625_00012751.DAT
in/DE_LDN_APC_20230626_00012764.DAT
in/DE_LDN_DCC_20230626_00012719.DAT
in/DE_LDN_RMC_20230626_00004033.DAT
in/DE_MER_RMC_20230626_00003613.DAT
in/DE_NTH_APC_20230625_00012755.DAT
in/DE_NTH_RMC_20230626_00003858.DAT
in/DE_NWN_RMC_20230626_00003613.DAT
in/DE_OXF_APC_20230625_00012758.DAT
in/DE_OXF_RMC_20230626_00003606.DAT
in/DE_PEN_RMC_20230626_00001740.DAT
in/DE_SEV_RMC_20230626_00001488.DAT
in/DE_WES_APC_20230625_00012757.DAT
in/DE_WES_RMC_20230626_00003921.DAT
in/DE_WMD_RMC_20230626_00003753.DAT
in/DE_YHD_RMC_20230626_00003844.DAT

...

Trigger

Memory of one RabbitMQ node reached the high watermark and all the incoming traffic (including ESR, Reval, TIS, etc.) were blocked.

...

Detection

We got an on ESR-DATA-EXPORTER service in #sentry-esr channel and also in #monitoring-esr channel, we didn’t receive any notifications for ESR processing in the morning.

...

Resolution

We purged the queue esr.queue.audit.neoto release memory of RabbitMQ and when the memory dropped below the watermark, it started to accept incoming traffic again.
We downloaded all the files arriving at ESR S3 bucket on 26th/Jun, and got them uploaded and processed manually.

...

Timeline

All times in BST unless indicated

22 26 Jun 2023 00:05 : gmc-client-service crashes attempting to run the overnight sync job due to a lack of memory
22 Jun 2023 01:07 - 08:47 : The monitoring channel showed the task was stopping and being replaced.
22 Jun 2023 08:53 : User reported (on Teams) revalidation module is showing one person under notice
22 Jun 2023 09:30 : Stopped the 2-hourly checks of submitted recommendation, shortly after stopped the service temporarily to stop unhelpful logging
22 Jun 2023 09:41 : Moved sync start messages to new queues for debugging
22 Jun 2023 09:43 : Found logging to suggest incident started at 00:05 - around the time of the gmc sync job starting
22 Jun 2023 ~ 9:45 : Stopped gmc-client task on prod
22 Jun 2023 ~10:00 : Restarted gmc-client task on prod, observed the same debug logs (later appeared to be not relevant), task stopped again.
22 Jun 2023 10:15 : Changed Log level for gmc-client (set to debug) and pushed to preprod
22 Jun 2023 11:30 : Added JAVA_TOOL_OPTIONS in task definition, then updated memory from 512M to 2G. As part of deploying this change, the production issue became an issue for our preprod environment
22 Jun 2023 ~ 11:30 Triggered GMC sync again on preprod. Failed due to memory error when making SOAP request to GMC
22 Jun 2023 ~12:15 Triggered GMC sync again on preprod after increasing memory allocation, this time it worked
22 Jun 2023 ~12:20 Identified separate issue with preprod regarding missing queues, reran jenkins build to restore them
22 Jun 2023 ~12:20 Triggered GMC sync again on prod after increasing memory allocation, this time it worked
22 Jun 2023 ~12:40 GMC sync appeared healthy on prod and doctors were appearing in connections 13:09 Noticed the RabbitMQ exceptions and its memory reached the high watermark. Then found there were more than 4 million message in queue esr.queue.audit.neo.
26 Jun 2023 15:06 Purged esr.queue.audit.neo , and then the memory dropped below the watermark and RabbitMQ started to accept incoming traffic again.
27 Jun 2023 10:26 Checked there were 20 files in total arriving at S3 bucket on 25th/Jun, and only 2 of them (DE_LDN_APC_20230626_00012764 & DE_LDN_DCC_20230626_00012719) were processed in the afternoon of 26th/Jun. Also purged the queue esr.dlq.2022-08-19-jay and esr.dlq.2022.08.09
27 Jun 2023 10:40-11:00 Manually downloaded files and uploaded them again to trigger ESR processing.

5 Whys (or other analysis of Root Cause)

Why were no doctors showing in the revalidation recommendations and connections summary lists for most of the day? - Because the GMC overnight sync had failed
Why had the GMC overnight sync job failed? - Because the gmc-client-service kept crashing and restarting the sync process
Why did the gmc-client-service keep crashing? - Because it was experiencing an out of memory error every time it received a response from GMC
It kept on crashing on startup because ?!?
Why was the gmc-client-service experiencing an out of memory error every time it received a response from GMC? - current unknown

...

ESR files were not processed in the morning of 26th/Jun? - Because the RabbitMQ was not accepting any incoming traffic and was not able to process transactions.
Why was RabbitMQ not able to accept incoming traffic and process transactions? - Because there was not enough memory.
Why there was not enough memory in RabbitMQ? - Because it was occupied by esr.queue.audit.neo queue. There were more than 4 million messages in it and it ate up the resource.
Why there were more than 4 million messages in esr.queue.audit.neo queue? - The message consumer could not read messages because the database was unavailable.
Image Added

...

Action Items

Action Items

Owner

Comments

Reproduce error on preprod by spinning up task definition with less memory?

Cai Willis

There’s a few minutes lag between calling the sync endpoint and the sync message showing up in rabbit, be patient and don’t trigger it multiple times

Dynamic modification of task definition: memory & CPU?

💲 💲 💲 💲 💲 💲

Small tasks/tidy up:

Reset cron schedules
Make new (log level) parameters for environment specific?

Can we improve the speed of the overnight sync job (particularly the CDC process from MongoDB via. the MasterDoctorIndex to recommendations)

Joseph (Pepe) Kelly Investigate why there are so many message in esr.queue.audit.neo queue when ESR files are being processed and who consumes this queue.

Monitor the queue for a period of time to find out if there’re still rapid increasing on the number of messages.

Yafang Deng Jayanta Saha

Jira Legacy

server	System JIRA
serverId	4c843cd5-e5a9-329d-ae88-66091fcfe3c7
key	TIS21-4720

Investigate why we still have incoming messages in esr.dlq.2022-08-19-jay and esr.dlq.2022.08.09

Jayanta Saha

This has led to missing another issue which appears to have increased since the start of June

Following above, test whether instance size or green/blue impacts the number of error messages produced?

Joseph (Pepe) Kelly

No increase in error messages on stage with instance size made to match prod

Clean up queues which aren’t needed right now, even if likely to be added in the near future?

Jira Legacy

server	System JIRA
serverId	4c843cd5-e5a9-329d-ae88-66091fcfe3c7
key	TIS21-32714747

...

Lessons Learned

Learn more knowledge about ESR.

Versions Compared

Old Version 1

New Version Current

Key

Non-technical Description

Trigger

Detection

Resolution

Trigger

Detection

Resolution

Timeline

5 Whys (or other analysis of Root Cause)

Action Items

Lessons Learned

Page Comparison

Versions Compared

Old Version 1

New Version Current

Key

Non-technical Description

Trigger

Detection

Resolution

Trigger

Detection

Resolution

Timeline

5 Whys (or other analysis of Root Cause)

Action Items

Lessons Learned