...

Action Items (1 Aug 2022 incident)

Owner

Look at list of CSV files that were received (as per esr.queue.csv.invalid and any others subsequent to stopping the ESR services that would have not been processed at all because the services were stopped)

Jayanta Saha Reuben Roberts Joseph (Pepe) Kelly

Review RabbitMQ esr.dlq.all messages [to identify any issues (such as?)]

Yafang Deng

Generate some Neo4J queries to check if a given {message id} was processed despite being in the DLQ

Joseph (Pepe) Kelly

Once turned-on MongoDB - check which files exported to ESR today, that they were processed before 16:11 failure

DONE Reuben Roberts

Manually resend rabbit messages for reprocessing (excluding any found in neo4j that were processed despite being in the DLQ)

Joseph (Pepe) Kelly Whilst verifying successful replay of first found that it was already queued.

Message IDs:

1db2f10c-2400-4a16-83bb-440d9962091e REPLAYED but UNNEEDED
~~97f63a59-e285-4d30-9f35-539618b257f1~~ ALREADY QUEUED
~~6db130ee-7e8b-4422-8719-142c964b8840~~ ALREADY QUEUED
~~d787bb7a-8a14-4321-becc-f4f1b9e3d35c~~ DUPLICATE
~~02679ce8-cca3-4ef7-87e0-b40994596cd7~~ ALREADY QUEUED
~~103edc7c-c684-4d07-945a-2a38f7802575~~ DUPLICATE

Restart ESR services (instructions for sequencing for this are here: https://hee-nhs-tis.slack.com/archives/C01C7V5GT43/p1612957581030400 )

Consider: for the position, placement and post queues, it is possible that create+delete messages will be processed in the incorrect order. Is there a way to check this?

Work out how to retrigger file processing (as per list of CSV files to be reprocessed that will be found in S3 bucket and others that were not processed at all and were received after approx. 16:11)

DE_SEV_RMC_20220730_00001157.DAT (16:10:36)

DE_SEV_RMC_20220731_00001158.DAT (16:11:00)

DE_WES_RMC_20220729_00003589.DAT

DE_WES_RMC_20220730_00003590.DAT

DE_WES_RMC_20220731_00003591.DAT (16:12:40)

DE_WMD_RMC_20220729_00003421.DAT

DE_WMD_RMC_20220730_00003422.DAT

DE_WMD_RMC_20220731_00003423.DAT

DE_YHD_RMC_20220729_00003512.DAT

DE_YHD_RMC_20220730_00003513.DAT

DE_YHD_RMC_20220731_00003514.DAT

DE_SEV_RMC_20220729_00001156.DAT

DE_EMD_RMC_20220801_00003116.DAT

…checked through deneries to find the additional RMC files…

DE_EOE_RMC_20220801_00003316.DAT
DE_KSS_RMC_20220801_00003306.DAT
DE_LDN_RMC_20220801_00003704.DAT
DE_MER_RMC_20220801_00003284.DAT
DE_NTH_RMC_20220801_00003529.DAT
DE_NWN_RMC_20220801_00003284.DAT
DE_OXF_RMC_20220801_00003277.DAT
DE_PEN_RMC_20220801_00001411.DAT
DE_SEV_RMC_20220801_00001159.DAT
DE_WES_RMC_20220801_00003592.DAT
DE_WMD_RMC_20220801_00003424.DAT
DE_YHD_RMC_20220801_00003515.DAT

DONE Joseph (Pepe) Kelly

Fix restart config for stage mongo & mongo_exporter services

Modify S3 File trigger to put file notification in a queue (rather than

Jira Legacy

server	System JIRA
serverId	4c843cd5-e5a9-329d-ae88-66091fcfe3c7
key	TIS21-146

)

Post-incident action items:

Look back over similar past incidents to see if there are other actions we should consider

Jayanta Saha

Ticket-up an action for adding VM memory alerting to slack

DONE Reuben Roberts

Jira Legacy

server	System JIRA
serverId	4c843cd5-e5a9-329d-ae88-66091fcfe3c7
key	TIS21-3298

...

Lessons Learned

Consider script to restart container if memory usage exceeds threshold
Since we are moving to a managed MongoDB Atlas service, the cost-benefit of further analysis of this issue is somewhat limited. However, this may be reviewed if:
- The frequency of incidents increases, or the move to MongoDB Atlas is delayed
- Particular developer interest in the issue (as personal development, which may include additional Mongo training)

Versions Compared

Old Version 23

New Version Current

Key

Lessons Learned

Page Comparison

Versions Compared

Old Version 23

New Version Current

Key

Lessons Learned