Date	01 Aug 2022
Authors
Status	Documenting
Summary	Database used fro exchanging information with ESR failed
Impact	Delay updating TIS & TSS with trainee and position information from ESR. If Friday’s file for Severn area contains information conflicting with Saturday/Sunday’s files, the older information would have been used.

Non-technical Description

ESR had another period of failing to send files on the day they were generated, this meant a greater number of files, generated between Friday 29th July and Monday 1st August were all sent in a short space of time. This is usually handled by application but this time, the database stopped responding.

The services that store information failed and a number of files were not processed. The built in alerting notified the team and after verifying the status of a number of failed individual transactions, we resolved the immediate problem and resent the instructions to process the files listed below.

A miscommunication meant that the file for Severn which was expected on Friday was processed after the files expected on Saturday & Sunday. If files from Saturday/Sunday contain information for trainees and positions which were also in the file from Friday, the earlier updates will have been processed incorrectly as more recent updates.

Trigger

Exceptions reported via Slack

Detection

Sentry alerting

Resolution

Force stopped the database server and restarted it, then requested processing of a number of files

Timeline

BST unless otherwise stated

2022-08-01 16:11 ESR processing failed messages start appearing on Slack #monitoring-esr channel
2022-08-01 16:17 Notifications picked up the team
2022-08-01 16:30ish ESR processes on Prod blue and green stopped
2022-08-01 16:32ish Prod MongoDB server stopped
2022-08-01 18:24 Prod MongoDB server started
2022-08-01 20:43 All ESR processes restarted in defined order
2022-08-01 20:36-21:21 Failed and missed RMC files processed in order defined below

Root Cause(s)

Action Items

Action Items	Owner
Look at list of CSV files that were received (as per esr.queue.csv.invalid and any others subsequent to stopping the ESR services that would have not been processed at all because the services were stopped)	Jayanta Saha Reuben Roberts Joseph (Pepe) Kelly
Review RabbitMQ esr.dlq.all messages [to identify any issues (such as?)]	Yafang Deng
Generate some Neo4J queries to check if a given {message id} was processed despite being in the DLQ	Joseph (Pepe) Kelly
Once turned-on MongoDB - check which files exported to ESR today, that they were processed before 16:11 failure	Reuben Roberts
Manually resend rabbit messages for reprocessing (excluding any found in neo4j that were processed despite being in the DLQ)	Joseph (Pepe) Kelly Whilst verifying sucessful replay of first found that it was aready queued. Message IDs: 1db2f10c-2400-4a16-83bb-440d9962091e REPLAYED but UNNEEDED ~~97f63a59-e285-4d30-9f35-539618b257f1~~ ALREADY QUEUED ~~6db130ee-7e8b-4422-8719-142c964b8840~~ ALREADY QUEUED ~~d787bb7a-8a14-4321-becc-f4f1b9e3d35c~~ DUPLICATE ~~02679ce8-cca3-4ef7-87e0-b40994596cd7~~ ALREADY QUEUED ~~103edc7c-c684-4d07-945a-2a38f7802575~~ DUPLICATE
Restart ESR services (instructions for sequencing for this are here: https://hee-nhs-tis.slack.com/archives/C01C7V5GT43/p1612957581030400 )
Consider: for the position, placement and post queues, it is possible that create+delete messages will be processed in the incorrect order. Is there a way to check this?
Work out how to retrigger file processing (as per list of CSV files to be reprocessed that will be found in S3 bucket and others that were not processed at all and were received after approx. 16:11) DE_SEV_RMC_20220730_00001157.DAT (16:10:36) DE_SEV_RMC_20220731_00001158.DAT (16:11:00) DE_WES_RMC_20220729_00003589.DAT DE_WES_RMC_20220730_00003590.DAT DE_WES_RMC_20220731_00003591.DAT (16:12:40) DE_WMD_RMC_20220729_00003421.DAT DE_WMD_RMC_20220730_00003422.DAT DE_WMD_RMC_20220731_00003423.DAT DE_YHD_RMC_20220729_00003512.DAT DE_YHD_RMC_20220730_00003513.DAT DE_YHD_RMC_20220731_00003514.DAT DE_SEV_RMC_20220729_00001156.DAT DE_EMD_RMC_20220801_00003116.DAT …checked through deneries to find the additional RMC files… DE_EOE_RMC_20220801_00003316.DAT DE_KSS_RMC_20220801_00003306.DAT DE_LDN_RMC_20220801_00003704.DAT DE_MER_RMC_20220801_00003284.DAT DE_NTH_RMC_20220801_00003529.DAT DE_NWN_RMC_20220801_00003284.DAT DE_OXF_RMC_20220801_00003277.DAT DE_PEN_RMC_20220801_00001411.DAT DE_SEV_RMC_20220801_00001159.DAT DE_WES_RMC_20220801_00003592.DAT DE_WMD_RMC_20220801_00003424.DAT DE_YHD_RMC_20220801_00003515.DAT	DONE Joseph (Pepe) Kelly
Later: Fix restart config for stage mongo & mongo_exporter services

Non-technical Description

Trigger

Detection

Resolution

Timeline

Root Cause(s)

Action Items

Lessons Learned

0 Comments

2022-08-01 ESR processing failures

Non-technical Description

Trigger

Detection

Resolution

Timeline

Root Cause(s)

Action Items

Lessons Learned

0 Comments