Documenting

Date	01 Aug 2022 and 04 Aug 2022
Authors	Joseph (Pepe) Kelly Reuben Roberts Jayanta Saha Yafang Deng
Status	Resolved
Summary	Database used for exchanging information with ESR failed
Impact	Delayed updates to TIS & TSS with trainee and position information from ESR. If Friday’s file for Severn area contains information conflicting with Saturday/Sunday’s files, the older information would have been used.

Non-technical Description

ESR had another period of failing to send files on the day they were generated, this meant a greater number of files, generated between Friday 29th July and Monday 1st August were all sent in a short space of time. This is usually handled by application but this time, the database stopped responding.

...

A miscommunication meant that the file for Severn which was expected on Friday was processed after the files expected on Saturday & Sunday. If files from Saturday/Sunday contain information for trainees and positions which were also in the file from Friday, the earlier updates will have been processed incorrectly as more recent updates.

...

Trigger

Exceptions reported via Slack

...

Detection

Sentry alerting

...

Resolution

Force stopped the database server and restarted it, then requested processing of a number of files

...

Timeline

BST unless otherwise stated

2022-08-01 16:11 ESR processing failed messages start appearing on Slack #monitoring-esr channel
2022-08-01 16:17 Notifications picked up the team
2022-08-01 16:30ish ESR processes on Prod blue and green stopped
2022-08-01 16:32ish Prod MongoDB server stopped
2022-08-01 18:24 Prod MongoDB server started
2022-08-01 20:43 All ESR processes restarted in defined order
2022-08-01 20:36-21:21 Failed and missed RMC files processed in order defined below

...

Root Cause(s)

Unusually high load from ESR?
2
3
4
5Exceptions reported via Slack from Sentry because the applications weren’t able to reach Mongo
The VM memory was maxed out
This seemed to cause database instability and connection issues

Selected log entries (note, this is an edited listing, it does not include all log entries):

For the incident on 1 Aug:

mongo1:
2022-08-01T15:12:43.591+0000 I REPL [replexec-73] Starting an election, since we've seen no PRIMARY in the past 10000ms
2022-08-01T15:12:43.596+0000 I CONNPOOL [RS] Ending idle connection to host mongo3:[some port 2] because the pool meets constraints; 1 connections to that host remain open
2022-08-01T15:12:43.596+0000 I REPL [replication-270] Restarting oplog query due to error: NetworkInterfaceExceededTimeLimit: error in fetcher batch callback :: caused by :: timed out. Last fetched optime (with hash): { ts: Timestamp(1659366606, 48), t: 171 }[1224256196200902242]. Restarts remaining: 1
2022-08-01T15:12:43.615+0000 I ASIO [Replication] Connecting to mongo2:[some port 1]
2022-08-01T15:12:43.615+0000 I ASIO [Replication] Connecting to mongo3:[some port 2]
2022-08-01T15:13:19.880+0000 I CONNPOOL [RS] Ending connection to host mongo3:[some port 2] due to bad connection status; 0 connections to that host remain open
2022-08-01T15:13:58.849+0000 I ASIO [RS] Connecting to mongo3:[some port 2]
2022-08-01T15:14:43.243+0000 I ASIO [Replication] Connecting to mongo2:[some port 1]
2022-08-01T15:14:43.290+0000 I NETWORK [conn65570] received client metadata from [some ip]:[some port] conn65570: { driver: { name: "mongo-java-driver|sync|spring-boot", version: "4.2.3" }, os: { type: "Linux", name: "Linux", architecture: "amd64", version: "4.15.0-189-generic" }, platform: "Java/Oracle Corporation/11.0.14.1+1" }
2022-08-01T15:14:43.310+0000 I ASIO [Replication] Connecting to mongo3:[some port 2]
2022-08-01T15:15:27.419+0000 I REPL [replexec-75] Starting an election, since we've seen no PRIMARY in the past 10000ms
2022-08-01T15:20:42.616+0000 I ASIO [Replication] Connecting to mongo3:[some port 2]
2022-08-01T15:20:47.287+0000 W NETWORK [listener] Error accepting new connection SocketException: remote_endpoint: Transport endpoint is not connected
2022-08-01T15:22:10.320+0000 I ASIO [Replication] Connecting to mongo2:[some port 1]
2022-08-01T15:22:12.704+0000 I NETWORK [conn65466] Error sending response to client: SocketException: Broken pipe. Ending connection from [some ip]:[some port] (connection id: 65466)
2022-08-01T15:22:18.606+0000 I REPL [replexec-76] Member mongo2:[some port 1] is now in state RS_DOWN
2022-08-01T15:24:43.435+0000 I NETWORK [LogicalSessionCacheRefresh] DBClientConnection failed to receive message from mongo3:[some port 2] - HostUnreachable: Connection closed by peer
2022-08-01T15:25:18.605+0000 I NETWORK [ReplicaSetMonitor-TaskExecutor] Ending connection to host mongo1:[some port 0](with timeout of 5 seconds) due to bad connection status; 0 connections to that host remain open
2022-08-01T15:25:18.618+0000 I REPL [replexec-77] Member mongo3:[some port 2] is now in state RS_DOWN
2022-08-01T15:30:17.402+0000 I NETWORK [ReplicaSetMonitor-TaskExecutor] Cannot reach any nodes for set rs0. Please check network connectivity and the status of the set. This has happened for 1 checks in a row.
2022-08-01T15:34:52.549+0000 I ASIO [Replication] Connecting to mongo2:[some port 1]
2022-08-01T15:34:52.549+0000 I ASIO [Replication] Connecting to mongo2:[some port 1]
2022-08-01T15:34:52.571+0000 I REPL [replexec-88] Not starting an election, since we are not electable due to: Not standing for election because I cannot see a majority (mask 0x1)
2022-08-01T15:34:52.577+0000 I NETWORK [conn65613] Error sending response to client: SocketException: Broken pipe. Ending connection from [some ip]:[some port] (connection id: 65613)
...
2022-08-01T17:24:49.905+0000 I CONTROL [initandlisten] MongoDB starting : pid=1 port=[some port 0] dbpath=/data/db 64-bit host=mongo1
2022-08-01T17:24:51.160+0000 I ASIO [Replication] Connecting to mongo2:[some port 1]
2022-08-01T17:24:51.160+0000 I ASIO [Replication] Connecting to mongo3:[some port 2]
2022-08-01T17:24:51.162+0000 I REPL [replexec-1] Member mongo2:[some port 1] is now in state SECONDARY
2022-08-01T17:24:51.162+0000 I REPL [replexec-2] Member mongo3:[some port 2] is now in state SECONDARY
mongo2:
2022-08-01T15:12:43.589+0000 I REPL [replexec-51] Starting an election, since we've seen no PRIMARY in the past 10000ms
2022-08-01T15:13:54.605+0000 I REPL [replexec-52] VoteRequester(term 171 dry run) failed to receive response from mongo1:[some port 0]: NetworkInterfaceExceededTimeLimit: timed out
2022-08-01T15:14:28.439+0000 I CONNPOOL [Replication] Ending connection to host mongo1:[some port 0] due to bad connection status; 1 connections to that host remain open
2022-08-01T15:15:46.952+0000 I ASIO [Replication] Failed to connect to mongo1:[some port 0] - HostUnreachable: Error connecting to mongo1:[some port 0] :: caused by :: Could not find address for mongo1:[some port 0]: SocketException: Host not found (non-authoritative), try again later
2022-08-01T15:24:53.709+0000 I ASIO [Replication] Failed to connect to mongo3:[some port 2] - HostUnreachable: Error connecting to mongo3:[some port 2] :: caused by :: Could not find address for mongo3:[some port 2]: SocketException: Host not found (non-authoritative), try again later
...
2022-08-01T17:24:48.787+0000 I CONTROL [initandlisten] MongoDB starting : pid=1 port=[some port 1] dbpath=/data/db 64-bit host=mongo2
mongo3:
2022-08-01T15:11:37.270+0000 I STORAGE [startPeriodicThreadToAbortExpiredTransactions] Aborting transaction with txnNumber 2113 on session with lsid dcb1e926-7dd3-4c5a-ba85-4d7ee3302948 because it has been running for longer than 'transactionLifetimeLimitSeconds'
2022-08-01T15:12:43.589+0000 I REPL [replexec-16299] Member mongo1:[some port 0] is now in state RS_DOWN
2022-08-01T15:12:43.589+0000 I REPL [replexec-16299] Member mongo2:[some port 1] is now in state RS_DOWN
2022-08-01T15:12:43.589+0000 I REPL [replexec-16299] can't see a majority of the set, relinquishing primary
2022-08-01T15:14:02.432+0000 I REPL [replexec-16303] Member mongo2:[some port 1] is now in state SECONDARY
2022-08-01T15:14:12.839+0000 I CONNPOOL [Replication] Dropping all pooled connections to mongo1:[some port 0] due to HostUnreachable: Timed out refreshing host
2022-08-01T15:16:50.174+0000 I NETWORK [conn62836] Error sending response to client: SocketException: Broken pipe. Ending connection from [some ip]:[some port] (connection id: 62836)
2022-08-01T15:25:18.648+0000 I CONNPOOL [Replication] Dropping all pooled connections to mongo1:[some port 0] due to HostUnreachable: Error connecting to mongo1:[some port 0] :: caused by :: Could not find address for mongo1:[some port 0]: SocketException: Host not found (non-authoritative), try again later
...
2022-08-01T17:24:48.780+0000 I CONTROL [initandlisten] MongoDB starting : pid=1 port=[some port 2] dbpath=/data/db 64-bit host=mongo3

For the incident on 4 Aug:

mongo1:
14:01:28 - network connection between mongo instances becomes unstable
2022-08-04T14:01:28.324+0000 I REPL [replexec-69] Starting an election, since we've seen no PRIMARY in the past 10000ms
2022-08-04T14:01:28.348+0000 I REPL [replication-201] Restarting oplog query due to error: NetworkInterfaceExceededTimeLimit: error in fetcher batch callback :: caused by :: timed out. Last fetched optime (with hash): { ts: Timestamp(1659621668, 3), t: 172 }[-1408329284315383846]. Restarts remaining: 1
2022-08-04T14:02:24.372+0000 I REPL [replexec-64] Member mongo3:[some port 2] is now in state RS_DOWN
2022-08-04T14:03:30.912+0000 I REPL [replexec-68] Member mongo2:[some port 1] is now in state RS_DOWN
14:01:28 - 14:16:01 connection errors continue
2022-08-04T14:11:01.773+0000 I REPL_HB [replexec-2] Error in heartbeat (requestId: 5) to mongo2:[some port 1], response status: HostUnreachable: Error connecting to mongo2:[some port 1] ([some ip 0]:[some port 1]) :: caused by :: Connection refused
14:16:01 UTC successfully reconnected to all instances
2022-08-04T14:16:01.767+0000 I NETWORK [ReplicaSetMonitor-TaskExecutor] Successfully connected to mongo1:[some port 0] (1 connections now open to mongo1:[some port 0] with a 5 second timeout)
mongo2:
2022-08-04T14:01:28.288+0000 I STORAGE [startPeriodicThreadToAbortExpiredTransactions] Aborting transaction with txnNumber 8048 on session with lsid 283bef84-bc35-4e89-9008-c9fa35c12c11 because it has been running for longer than 'transactionLifetimeLimitSeconds'
2022-08-04T14:01:28.327+0000 I REPL [replexec-2187] Member mongo1:[some port 0] is now in state RS_DOWN
2022-08-04T14:01:28.344+0000 I REPL [replexec-2189] Member mongo1:[some port 0] is now in state SECONDARY
2022-08-04T14:02:24.357+0000 I REPL_HB [replexec-2187] Error in heartbeat (requestId: 247044) to mongo1:[some port 0], response status: NetworkInterfaceExceededTimeLimit: Remote command timed out while waiting to get a connection from the pool, took 11979ms, timeout was set to 10000ms
2022-08-04T14:02:24.357+0000 I REPL [replexec-2187] Member mongo1:[some port 0] is now in state RS_DOWN
2022-08-04T14:02:24.402+0000 I REPL [replexec-2187] Member mongo3:[some port 2] is now in state RS_DOWN
2022-08-04T14:02:43.812+0000 I ASIO [Replication] Connecting to mongo1:[some port 0]
2022-08-04T14:02:44.635+0000 I ASIO [Replication] Connecting to mongo3:[some port 2]
2022-08-04T14:03:38.368+0000 I ASIO [Replication] Failed to connect to mongo1:[some port 0] - HostUnreachable: Error connecting to mongo1:[some port 0] :: caused by :: Could not find address for mongo1:[some port 0]: SocketException: Host not found (non-authoritative), try again later
2022-08-04T14:11:00.333+0000 I CONTROL [initandlisten] MongoDB starting : pid=1 port=[some port 1] dbpath=/data/db 64-bit host=mongo2
2022-08-04T14:11:01.862+0000 I ASIO [Replication] Failed to connect to mongo3:[some port 2] - HostUnreachable: Error connecting to mongo3:[some port 2] ([some ip 1]:[some port 2]) :: caused by :: Connection refused
2022-08-04T14:11:01.863+0000 I REPL [replexec-1] Member mongo3:[some port 2] is now in state RS_DOWN
2022-08-04T14:11:02.363+0000 I ASIO [Replication] Connecting to mongo3:[some port 2]
2022-08-04T14:11:02.376+0000 I REPL [replexec-2] Member mongo3:[some port 2] is now in state SECONDARY
2022-08-04T14:11:12.535+0000 I CONNPOOL [Replication] Ending connection to host mongo1:[some port 0] due to bad connection status; 0 connections to that host remain open
2022-08-04T14:11:12.863+0000 I ASIO [Replication] Connecting to mongo1:[some port 0]
2022-08-04T14:11:12.876+0000 I REPL [replexec-0] Member mongo3:[some port 2] is now in state PRIMARY
2022-08-04T14:16:01.866+0000 I NETWORK [LogicalSessionCacheRefresh] Starting new replica set monitor for rs0/mongo1:[some port 0],mongo2:[some port 1],mongo3:[some port 2]
mongo3:
2022-08-04T14:01:28.326+0000 I REPL [replexec-25] Starting an election, since we've seen no PRIMARY in the past 10000ms
2022-08-04T14:01:28.348+0000 I REPL [replication-187] Restarting oplog query due to error: NetworkInterfaceExceededTimeLimit: error in fetcher batch callback :: caused by :: timed out. Last fetched optime (with hash): { ts: Timestamp(1659621668, 3), t: 172 }[-1408329284315383846]. Restarts remaining: 1
2022-08-04T14:01:28.929+0000 I CONNPOOL [RS] Ending connection to host mongo2:[some port 1] due to bad connection status; 1 connections to that host remain open
2022-08-04T14:02:24.299+0000 I ASIO [RS] Failed to connect to mongo2:[some port 1] - HostUnreachable: Error connecting to mongo2:[some port 1] :: caused by :: Could not find address for mongo2:[some port 1]: SocketException: Host not found (non-authoritative), try again later
2022-08-04T14:02:24.324+0000 I CONNPOOL [Replication] Dropping all pooled connections to mongo2:[some port 1] due to HostUnreachable: Error connecting to mongo2:[some port 1] :: caused by :: Could not find address for mongo2:[some port 1]: SocketException: Host not found (non-authoritative), try again later
2022-08-04T14:02:24.399+0000 I CONNPOOL [Replication] Ending connection to host mongo1:[some port 0] due to bad connection status; 1 connections to that host remain open
2022-08-04T14:02:24.399+0000 I ASIO [Replication] Connecting to mongo2:[some port 1]
2022-08-04T14:02:24.399+0000 I ASIO [Replication] Connecting to mongo1:[some port 0]
2022-08-04T14:02:45.568+0000 I REPL_HB [replexec-24] Error in heartbeat (requestId: 4419235) to mongo2:[some port 1], response status: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit
2022-08-04T14:02:46.204+0000 I COMMAND [ftdc] serverStatus was very slow: { after basic: 2231, after asserts: 6557, after backgroundFlushing: 9147, after connections: 11942, after dur: 12535, after electionMetrics: 12535, after extra_info: 12595, after globalLock: 12625, after locks: 12625, after logicalSessionRecordCache: 12668, after network: 12668, after opLatencies: 12670, after opReadConcernCounters: 12687, after opcounters: 12687, after opcountersRepl: 12687, after oplogTruncation: 12687, after repl: 12701, after security: 12717, after storageEngine: 12717, after tcmalloc: 12917, after transactions: 13018, after transportSecurity: 13018, after wiredTiger: 13641, at end: 14765 }
2022-08-04T14:02:58.646+0000 I REPL [replexec-24] Member mongo2:[some port 1] is now in state RS_DOWN
2022-08-04T14:04:04.110+0000 I ASIO [Replication] Failed to connect to mongo1:[some port 0] - HostUnreachable: Error connecting to mongo1:[some port 0] :: caused by :: Could not find address for mongo1:[some port 0]: SocketException: Host not found (non-authoritative), try again later
2022-08-04T14:04:04.669+0000 I REPL [replexec-26] Member mongo1:[some port 0] is now in state RS_DOWN

Grafana performance metrics for the Mongo cluster:
2022-08-01 failureincident

2022-08-04 failureincident

...

Action Items

Action Items (1 Aug 2022 incident)

Owner

Look at list of CSV files that were received (as per esr.queue.csv.invalid and any others subsequent to stopping the ESR services that would have not been processed at all because the services were stopped)

Jayanta Saha Reuben Roberts Joseph (Pepe) Kelly

Review RabbitMQ esr.dlq.all messages [to identify any issues (such as?)]

Yafang Deng

Generate some Neo4J queries to check if a given {message id} was processed despite being in the DLQ

Joseph (Pepe) Kelly

Once turned-on MongoDB - check which files exported to ESR today, that they were processed before 16:11 failure

DONE Reuben Roberts

Manually resend rabbit messages for reprocessing (excluding any found in neo4j that were processed despite being in the DLQ)

Joseph (Pepe) Kelly Whilst verifying successful replay of first found that it was already queued.

Message IDs:

1db2f10c-2400-4a16-83bb-440d9962091e REPLAYED but UNNEEDED
~~97f63a59-e285-4d30-9f35-539618b257f1~~ ALREADY QUEUED
~~6db130ee-7e8b-4422-8719-142c964b8840~~ ALREADY QUEUED
~~d787bb7a-8a14-4321-becc-f4f1b9e3d35c~~ DUPLICATE
~~02679ce8-cca3-4ef7-87e0-b40994596cd7~~ ALREADY QUEUED
~~103edc7c-c684-4d07-945a-2a38f7802575~~ DUPLICATE

Restart ESR services (instructions for sequencing for this are here: https://hee-nhs-tis.slack.com/archives/C01C7V5GT43/p1612957581030400 )

Consider: for the position, placement and post queues, it is possible that create+delete messages will be processed in the incorrect order. Is there a way to check this?

Work out how to retrigger file processing (as per list of CSV files to be reprocessed that will be found in S3 bucket and others that were not processed at all and were received after approx. 16:11)

DE_SEV_RMC_20220730_00001157.DAT (16:10:36)

DE_SEV_RMC_20220731_00001158.DAT (16:11:00)

DE_WES_RMC_20220729_00003589.DAT

DE_WES_RMC_20220730_00003590.DAT

DE_WES_RMC_20220731_00003591.DAT (16:12:40)

DE_WMD_RMC_20220729_00003421.DAT

DE_WMD_RMC_20220730_00003422.DAT

DE_WMD_RMC_20220731_00003423.DAT

DE_YHD_RMC_20220729_00003512.DAT

DE_YHD_RMC_20220730_00003513.DAT

DE_YHD_RMC_20220731_00003514.DAT

DE_SEV_RMC_20220729_00001156.DAT

DE_EMD_RMC_20220801_00003116.DAT

…checked through deneries to find the additional RMC files…

DE_EOE_RMC_20220801_00003316.DAT
DE_KSS_RMC_20220801_00003306.DAT
DE_LDN_RMC_20220801_00003704.DAT
DE_MER_RMC_20220801_00003284.DAT
DE_NTH_RMC_20220801_00003529.DAT
DE_NWN_RMC_20220801_00003284.DAT
DE_OXF_RMC_20220801_00003277.DAT
DE_PEN_RMC_20220801_00001411.DAT
DE_SEV_RMC_20220801_00001159.DAT
DE_WES_RMC_20220801_00003592.DAT
DE_WMD_RMC_20220801_00003424.DAT
DE_YHD_RMC_20220801_00003515.DAT

DONE Joseph (Pepe) Kelly

Fix restart config for stage mongo & mongo_exporter services

Modify S3 File trigger to put file notification in a queue (rather than

Jira Legacy

server	System JIRA
serverId	4c843cd5-e5a9-329d-ae88-66091fcfe3c7
key	TIS21-146

)

Post-incident action items:

Look back over similar past incidents to see if there are other actions we should consider

Jayanta Saha

Ticket-up an action for adding VM memory alerting to slack

Reuben Roberts

...

Lessons Learned

Consider script to restart container if memory usage exceeds threshold
Since we are moving to a managed MongoDB Atlas service, the cost-benefit of further analysis of this issue is somewhat limited. However, this may be reviewed if:
- The frequency of incidents increases, or the move to MongoDB Atlas is delayed
- Particular developer interest in the issue (as personal development, which may include additional Mongo training)

Versions Compared

Old Version 22

New Version 23

Key

Non-technical Description

Trigger

Detection

Resolution

Timeline

Root Cause(s)

Action Items

Lessons Learned

Page Comparison

Versions Compared

Old Version 22

New Version 23

Key

Non-technical Description

Trigger

Detection

Resolution

Timeline

Root Cause(s)

Action Items

Lessons Learned