2023-06-26 ESR Inbound files were not processed

Date

Jun 26, 2023

Authors

@Yafang Deng @Jayanta Saha @John Simmons (Deactivated) @James Harris @catherine.odukale (Unlicensed)

Status

Done

Summary

https://hee-tis.atlassian.net/browse/TIS21-4707

Impact

Delayed updates to TIS and TSS with the trainee and position information from ESR, and TIS delayed exporting outbound files to ESR.

Non-technical Description

The TIS-ESR interface works with certain files being sent and received at specific times. We observed an yesterday that stopped some of these files being processed when they should have been. In order to not miss next time frame for these files we needed to ensure these files were processed before the next day’s inbound files.

The services that store information failed and a number of files were not processed.  The built in alerting notified the team and after verifying the status of a number of failed individual transactions, we resolved the immediate problem and resent the instructions to process the files listed below.

in/DE_EMD_RMC_20230626_00003445.DAT
in/DE_EOE_RMC_20230626_00003645.DAT
in/DE_KSS_APC_20230625_00012753.DAT
in/DE_KSS_RMC_20230626_00003635.DAT
in/DE_LDN_APC_20230625_00012751.DAT
in/DE_LDN_APC_20230626_00012764.DAT
in/DE_LDN_DCC_20230626_00012719.DAT
in/DE_LDN_RMC_20230626_00004033.DAT
in/DE_MER_RMC_20230626_00003613.DAT
in/DE_NTH_APC_20230625_00012755.DAT
in/DE_NTH_RMC_20230626_00003858.DAT
in/DE_NWN_RMC_20230626_00003613.DAT
in/DE_OXF_APC_20230625_00012758.DAT
in/DE_OXF_RMC_20230626_00003606.DAT
in/DE_PEN_RMC_20230626_00001740.DAT
in/DE_SEV_RMC_20230626_00001488.DAT
in/DE_WES_APC_20230625_00012757.DAT
in/DE_WES_RMC_20230626_00003921.DAT
in/DE_WMD_RMC_20230626_00003753.DAT
in/DE_YHD_RMC_20230626_00003844.DAT


Trigger

Memory of one RabbitMQ node reached the high watermark and all the incoming traffic (including ESR, Reval, TIS, etc.) were blocked.


Detection

We got an on ESR-DATA-EXPORTER service in #sentry-esr channel and also in #monitoring-esr channel, we didn’t receive any notifications for ESR processing in the morning.


Resolution

  1. We purged the queue esr.queue.audit.neoto release memory of RabbitMQ and when the memory dropped below the watermark, it started to accept incoming traffic again.

  2. We downloaded all the files arriving at ESR S3 bucket on 26th/Jun, and got them uploaded and processed manually.


Timeline

All times in BST unless indicated

  • Jun 26, 2023 13:09 Noticed the RabbitMQ exceptions and its memory reached the high watermark. Then found there were more than 4 million message in queue esr.queue.audit.neo.

  • Jun 26, 2023 15:06 Purged esr.queue.audit.neo , and then the memory dropped below the watermark and RabbitMQ started to accept incoming traffic again.

  • Jun 27, 2023 10:26 Checked there were 20 files in total arriving at S3 bucket on 25th/Jun, and only 2 of them (DE_LDN_APC_20230626_00012764 & DE_LDN_DCC_20230626_00012719) were processed in the afternoon of 26th/Jun. Also purged the queue esr.dlq.2022-08-19-jay and esr.dlq.2022.08.09

  • Jun 27, 2023 10:40-11:00 Manually downloaded files and uploaded them again to trigger ESR processing.

5 Whys (or other analysis of Root Cause)

  1. Why ESR files were not processed in the morning of 26th/Jun? - Because the RabbitMQ was not accepting any incoming traffic and was not able to process transactions.

  2. Why was RabbitMQ not able to accept incoming traffic and process transactions? - Because there was not enough memory.

  3. Why there was not enough memory in RabbitMQ? - Because it was occupied by esr.queue.audit.neo queue. There were more than 4 million messages in it and it ate up the resource.

  4. Why there were more than 4 million messages in esr.queue.audit.neo queue? - The message consumer could not read messages because the database was unavailable.


Action Items

Action Items

Owner

Comments

Action Items

Owner

Comments

Investigate why there are so many message in esr.queue.audit.neo queue when ESR files are being processed and who consumes this queue.

Monitor the queue for a period of time to find out if there’re still rapid increasing on the number of messages.

@Yafang Deng @Jayanta Saha

https://hee-tis.atlassian.net/browse/TIS21-4720

 

Investigate why we still have incoming messages in esr.dlq.2022-08-19-jay and esr.dlq.2022.08.09

@Jayanta Saha

This has led to missing another issue which appears to have increased since the start of June

Following above, test whether instance size or green/blue impacts the number of error messages produced?

@Joseph (Pepe) Kelly

No increase in error messages on stage with instance size made to match prod

Clean up queues which aren’t needed right now, even if likely to be added in the near future?

https://hee-tis.atlassian.net/browse/TIS21-4747

 

 


Lessons Learned

  • Learn more knowledge about ESR.