Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • 2022-08-01 16:11 ESR processing failed messages start appearing on Slack #monitoring-esr channel

  • 2022-08-01 16:17 Notifications picked up the team

  • 2022-08-01 16:30ish ESR processes on Prod blue and green stopped

  • 2022-08-01 16:32ish Prod MongoDB server stopped

  • 2022-08-01 18:24 Prod MongoDB server started

  • 2022-08-01 20:43 All ESR processes restarted in defined order

  • 2022-08-01 20:36-21:21 Failed and missed RMC files processed in order defined below

...

Root Cause(s)

  • Database instabilityUnusually high load from ESR?

  • 2

  • 3

  • 4

  • 5
    Selected log entries (note, this is an edited listing, it does not include all log entries):

    mongo1:

    2022-08-01T15:12:43.591+0000 I REPL [replexec-73] Starting an election, since we've seen no PRIMARY in the past 10000ms
    2022-08-01T15:12:43.596+0000 I CONNPOOL [RS] Ending idle connection to host mongo3:[some port 2] because the pool meets constraints; 1 connections to that host remain open
    2022-08-01T15:12:43.596+0000 I REPL [replication-270] Restarting oplog query due to error: NetworkInterfaceExceededTimeLimit: error in fetcher batch callback :: caused by :: timed out. Last fetched optime (with hash): { ts: Timestamp(1659366606, 48), t: 171 }[1224256196200902242]. Restarts remaining: 1
    2022-08-01T15:12:43.615+0000 I ASIO [Replication] Connecting to mongo2:[some port 1]
    2022-08-01T15:12:43.615+0000 I ASIO [Replication] Connecting to mongo3:[some port 2]
    2022-08-01T15:13:19.880+0000 I CONNPOOL [RS] Ending connection to host mongo3:[some port 2] due to bad connection status; 0 connections to that host remain open
    2022-08-01T15:13:58.849+0000 I ASIO [RS] Connecting to mongo3:[some port 2]
    2022-08-01T15:14:43.243+0000 I ASIO [Replication] Connecting to mongo2:[some port 1]
    2022-08-01T15:14:43.290+0000 I NETWORK [conn65570] received client metadata from [some ip]:[some port] conn65570: { driver: { name: "mongo-java-driver|sync|spring-boot", version: "4.2.3" }, os: { type: "Linux", name: "Linux", architecture: "amd64", version: "4.15.0-189-generic" }, platform: "Java/Oracle Corporation/11.0.14.1+1" }
    2022-08-01T15:14:43.310+0000 I ASIO [Replication] Connecting to mongo3:[some port 2]
    2022-08-01T15:15:27.419+0000 I REPL [replexec-75] Starting an election, since we've seen no PRIMARY in the past 10000ms
    2022-08-01T15:20:42.616+0000 I ASIO [Replication] Connecting to mongo3:[some port 2]
    2022-08-01T15:20:47.287+0000 W NETWORK [listener] Error accepting new connection SocketException: remote_endpoint: Transport endpoint is not connected
    2022-08-01T15:22:10.320+0000 I ASIO [Replication] Connecting to mongo2:[some port 1]
    2022-08-01T15:22:12.704+0000 I NETWORK [conn65466] Error sending response to client: SocketException: Broken pipe. Ending connection from [some ip]:[some port] (connection id: 65466)
    2022-08-01T15:22:18.606+0000 I REPL [replexec-76] Member mongo2:[some port 1] is now in state RS_DOWN
    2022-08-01T15:24:43.435+0000 I NETWORK [LogicalSessionCacheRefresh] DBClientConnection failed to receive message from mongo3:[some port 2] - HostUnreachable: Connection closed by peer
    2022-08-01T15:25:18.605+0000 I NETWORK [ReplicaSetMonitor-TaskExecutor] Ending connection to host mongo1:[some port 0](with timeout of 5 seconds) due to bad connection status; 0 connections to that host remain open
    2022-08-01T15:25:18.618+0000 I REPL [replexec-77] Member mongo3:[some port 2] is now in state RS_DOWN
    2022-08-01T15:30:17.402+0000 I NETWORK [ReplicaSetMonitor-TaskExecutor] Cannot reach any nodes for set rs0. Please check network connectivity and the status of the set. This has happened for 1 checks in a row.
    2022-08-01T15:34:52.549+0000 I ASIO [Replication] Connecting to mongo2:[some port 1]
    2022-08-01T15:34:52.549+0000 I ASIO [Replication] Connecting to mongo2:[some port 1]
    2022-08-01T15:34:52.571+0000 I REPL [replexec-88] Not starting an election, since we are not electable due to: Not standing for election because I cannot see a majority (mask 0x1)
    2022-08-01T15:34:52.577+0000 I NETWORK [conn65613] Error sending response to client: SocketException: Broken pipe. Ending connection from [some ip]:[some port] (connection id: 65613)
    ...
    2022-08-01T17:24:49.905+0000 I CONTROL [initandlisten] MongoDB starting : pid=1 port=[some port 0] dbpath=/data/db 64-bit host=mongo1
    2022-08-01T17:24:51.160+0000 I ASIO [Replication] Connecting to mongo2:[some port 1]
    2022-08-01T17:24:51.160+0000 I ASIO [Replication] Connecting to mongo3:[some port 2]
    2022-08-01T17:24:51.162+0000 I REPL [replexec-1] Member mongo2:[some port 1] is now in state SECONDARY
    2022-08-01T17:24:51.162+0000 I REPL [replexec-2] Member mongo3:[some port 2] is now in state SECONDARY

    mongo2:

    2022-08-01T15:12:43.589+0000 I REPL [replexec-51] Starting an election, since we've seen no PRIMARY in the past 10000ms
    2022-08-01T15:13:54.605+0000 I REPL [replexec-52] VoteRequester(term 171 dry run) failed to receive response from mongo1:[some port 0]: NetworkInterfaceExceededTimeLimit: timed out
    2022-08-01T15:14:28.439+0000 I CONNPOOL [Replication] Ending connection to host mongo1:[some port 0] due to bad connection status; 1 connections to that host remain open
    2022-08-01T15:15:46.952+0000 I ASIO [Replication] Failed to connect to mongo1:[some port 0] - HostUnreachable: Error connecting to mongo1:[some port 0] :: caused by :: Could not find address for mongo1:[some port 0]: SocketException: Host not found (non-authoritative), try again later
    2022-08-01T15:24:53.709+0000 I ASIO [Replication] Failed to connect to mongo3:[some port 2] - HostUnreachable: Error connecting to mongo3:[some port 2] :: caused by :: Could not find address for mongo3:[some port 2]: SocketException: Host not found (non-authoritative), try again later
    ...
    2022-08-01T17:24:48.787+0000 I CONTROL [initandlisten] MongoDB starting : pid=1 port=[some port 1] dbpath=/data/db 64-bit host=mongo2

    mongo3:

    2022-08-01T15:11:37.270+0000 I STORAGE [startPeriodicThreadToAbortExpiredTransactions] Aborting transaction with txnNumber 2113 on session with lsid dcb1e926-7dd3-4c5a-ba85-4d7ee3302948 because it has been running for longer than 'transactionLifetimeLimitSeconds'
    2022-08-01T15:12:43.589+0000 I REPL [replexec-16299] Member mongo1:[some port 0] is now in state RS_DOWN
    2022-08-01T15:12:43.589+0000 I REPL [replexec-16299] Member mongo2:[some port 1] is now in state RS_DOWN
    2022-08-01T15:12:43.589+0000 I REPL [replexec-16299] can't see a majority of the set, relinquishing primary
    2022-08-01T15:14:02.432+0000 I REPL [replexec-16303] Member mongo2:[some port 1] is now in state SECONDARY
    2022-08-01T15:14:12.839+0000 I CONNPOOL [Replication] Dropping all pooled connections to mongo1:[some port 0] due to HostUnreachable: Timed out refreshing host
    2022-08-01T15:16:50.174+0000 I NETWORK [conn62836] Error sending response to client: SocketException: Broken pipe. Ending connection from [some ip]:[some port] (connection id: 62836)
    2022-08-01T15:25:18.648+0000 I CONNPOOL [Replication] Dropping all pooled connections to mongo1:[some port 0] due to HostUnreachable: Error connecting to mongo1:[some port 0] :: caused by :: Could not find address for mongo1:[some port 0]: SocketException:
    TODOHost not found (non-authoritative), try again later
    ...
    2022-08-01T17:24:48.780+0000 I CONTROL [initandlisten] MongoDB starting : pid=1 port=[some port 2] dbpath=/data/db 64-bit host=mongo3

    mongo1:

    14:01:28 - network connection between mongo instances becomes unstable
    2022-08-04T14:01:28.324+0000 I REPL [replexec-69] Starting an election, since we've seen no PRIMARY in the past 10000ms
    2022-08-04T14:01:28.348+0000 I REPL [replication-201] Restarting oplog query due to error: NetworkInterfaceExceededTimeLimit: error in fetcher batch callback :: caused by :: timed out. Last fetched optime (with hash): { ts: Timestamp(1659621668, 3), t: 172 }[-1408329284315383846]. Restarts remaining: 1
    2022-08-04T14:02:24.372+0000 I REPL [replexec-64] Member mongo3:[some port 2] is now in state RS_DOWN
    2022-08-04T14:03:30.912+0000 I REPL [replexec-68] Member mongo2:[some port 1] is now in state RS_DOWN

    14:01:28 - 14:16:01 connection errors continue
    2022-08-04T14:11:01.773+0000 I REPL_HB [replexec-2] Error in heartbeat (requestId: 5) to mongo2:[some port 1], response status: HostUnreachable: Error connecting to mongo2:[some port 1] ([some ip 0]:[some port 1]) :: caused by :: Connection refused

    14:16:01 UTC successfully reconnected to all instances
    2022-08-04T14:16:01.767+0000 I NETWORK [ReplicaSetMonitor-TaskExecutor] Successfully connected to mongo1:[some port 0] (1 connections now open to mongo1:[some port 0] with a 5 second timeout)

    mongo2:

    2022-08-04T14:01:28.288+0000 I STORAGE [startPeriodicThreadToAbortExpiredTransactions] Aborting transaction with txnNumber 8048 on session with lsid 283bef84-bc35-4e89-9008-c9fa35c12c11 because it has been running for longer than 'transactionLifetimeLimitSeconds'
    2022-08-04T14:01:28.327+0000 I REPL [replexec-2187] Member mongo1:[some port 0] is now in state RS_DOWN
    2022-08-04T14:01:28.344+0000 I REPL [replexec-2189] Member mongo1:[some port 0] is now in state SECONDARY
    2022-08-04T14:02:24.357+0000 I REPL_HB [replexec-2187] Error in heartbeat (requestId: 247044) to mongo1:[some port 0], response status: NetworkInterfaceExceededTimeLimit: Remote command timed out while waiting to get a connection from the pool, took 11979ms, timeout was set to 10000ms
    2022-08-04T14:02:24.357+0000 I REPL [replexec-2187] Member mongo1:[some port 0] is now in state RS_DOWN
    2022-08-04T14:02:24.402+0000 I REPL [replexec-2187] Member mongo3:[some port 2] is now in state RS_DOWN
    2022-08-04T14:02:43.812+0000 I ASIO [Replication] Connecting to mongo1:[some port 0]
    2022-08-04T14:02:44.635+0000 I ASIO [Replication] Connecting to mongo3:[some port 2]
    2022-08-04T14:03:38.368+0000 I ASIO [Replication] Failed to connect to mongo1:[some port 0] - HostUnreachable: Error connecting to mongo1:[some port 0] :: caused by :: Could not find address for mongo1:[some port 0]: SocketException: Host not found (non-authoritative), try again later
    2022-08-04T14:11:00.333+0000 I CONTROL [initandlisten] MongoDB starting : pid=1 port=[some port 1] dbpath=/data/db 64-bit host=mongo2
    2022-08-04T14:11:01.862+0000 I ASIO [Replication] Failed to connect to mongo3:[some port 2] - HostUnreachable: Error connecting to mongo3:[some port 2] ([some ip 1]:[some port 2]) :: caused by :: Connection refused
    2022-08-04T14:11:01.863+0000 I REPL [replexec-1] Member mongo3:[some port 2] is now in state RS_DOWN
    2022-08-04T14:11:02.363+0000 I ASIO [Replication] Connecting to mongo3:[some port 2]
    2022-08-04T14:11:02.376+0000 I REPL [replexec-2] Member mongo3:[some port 2] is now in state SECONDARY
    2022-08-04T14:11:12.535+0000 I CONNPOOL [Replication] Ending connection to host mongo1:[some port 0] due to bad connection status; 0 connections to that host remain open
    2022-08-04T14:11:12.863+0000 I ASIO [Replication] Connecting to mongo1:[some port 0]
    2022-08-04T14:11:12.876+0000 I REPL [replexec-0] Member mongo3:[some port 2] is now in state PRIMARY
    2022-08-04T14:16:01.866+0000 I NETWORK [LogicalSessionCacheRefresh] Starting new replica set monitor for rs0/mongo1:[some port 0],mongo2:[some port 1],mongo3:[some port 2]

    mongo3:

    2022-08-04T14:01:28.326+0000 I REPL [replexec-25] Starting an election, since we've seen no PRIMARY in the past 10000ms
    2022-08-04T14:01:28.348+0000 I REPL [replication-187] Restarting oplog query due to error: NetworkInterfaceExceededTimeLimit: error in fetcher batch callback :: caused by :: timed out. Last fetched optime (with hash): { ts: Timestamp(1659621668, 3), t: 172 }[-1408329284315383846]. Restarts remaining: 1
    2022-08-04T14:01:28.929+0000 I CONNPOOL [RS] Ending connection to host mongo2:[some port 1] due to bad connection status; 1 connections to that host remain open
    2022-08-04T14:02:24.299+0000 I ASIO [RS] Failed to connect to mongo2:[some port 1] - HostUnreachable: Error connecting to mongo2:[some port 1] :: caused by :: Could not find address for mongo2:[some port 1]: SocketException: Host not found (non-authoritative), try again later
    2022-08-04T14:02:24.324+0000 I CONNPOOL [Replication] Dropping all pooled connections to mongo2:[some port 1] due to HostUnreachable: Error connecting to mongo2:[some port 1] :: caused by :: Could not find address for mongo2:[some port 1]: SocketException: Host not found (non-authoritative), try again later
    2022-08-04T14:02:24.399+0000 I CONNPOOL [Replication] Ending connection to host mongo1:[some port 0] due to bad connection status; 1 connections to that host remain open
    2022-08-04T14:02:24.399+0000 I ASIO [Replication] Connecting to mongo2:[some port 1]
    2022-08-04T14:02:24.399+0000 I ASIO [Replication] Connecting to mongo1:[some port 0]
    2022-08-04T14:02:45.568+0000 I REPL_HB [replexec-24] Error in heartbeat (requestId: 4419235) to mongo2:[some port 1], response status: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit
    2022-08-04T14:02:46.204+0000 I COMMAND [ftdc] serverStatus was very slow: { after basic: 2231, after asserts: 6557, after backgroundFlushing: 9147, after connections: 11942, after dur: 12535, after electionMetrics: 12535, after extra_info: 12595, after globalLock: 12625, after locks: 12625, after logicalSessionRecordCache: 12668, after network: 12668, after opLatencies: 12670, after opReadConcernCounters: 12687, after opcounters: 12687, after opcountersRepl: 12687, after oplogTruncation: 12687, after repl: 12701, after security: 12717, after storageEngine: 12717, after tcmalloc: 12917, after transactions: 13018, after transportSecurity: 13018, after wiredTiger: 13641, at end: 14765 }
    2022-08-04T14:02:58.646+0000 I REPL [replexec-24] Member mongo2:[some port 1] is now in state RS_DOWN
    2022-08-04T14:04:04.110+0000 I ASIO [Replication] Failed to connect to mongo1:[some port 0] - HostUnreachable: Error connecting to mongo1:[some port 0] :: caused by :: Could not find address for mongo1:[some port 0]: SocketException: Host not found (non-authoritative), try again later
    2022-08-04T14:04:04.669+0000 I REPL [replexec-26] Member mongo1:[some port 0] is now in state RS_DOWN





  • Grafana performance metrics for the Mongo cluster:

    2022-08-01 failure

    2022-08-04 failure

...