...
Messages in Slack sentry-esr channel.
Mongo logs show database failure:
2022-05-05T14:03:25.503+0000 I REPL [replication-162] Restarting oplog query due to error: NetworkInterfaceExceededTimeLimit: error in fetcher batch callback :: caused by :: timed out. Last fetched optime (with hash): { ts: Timestamp(1651759205, 13), t: 162 }[3526795812898854415]. Restarts remaining: 1
2022-05-05T14:03:50.274+0000 I CONNPOOL [Replication] Dropping all pooled connections to mongo2:27012 due to HostUnreachable: Timed out refreshing host
2022-05-05T14:03:51.319+0000 I NETWORK [conn39060] received client metadata from 10.170.0.138:51128 conn39060: { driver: { name: "mongo-java-driver|sync|spring-boot", version: "4.2.3" }, os: { type: "Linux", name: "Linux", architecture: "amd64", version: "4.4.0-210-generic" }, platform: "Java/Oracle Corporation/11.0.14.1+1" }
2022-05-05T14:04:03.382+0000 I COMMAND [LogicalSessionCacheReap] command config.system.sessions command: listIndexes { listIndexes: "system.sessions", cursor: {}, $db: "config" } numYields:0 reslen:449 locks:{ Global: { acquireCount: { r: 1 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { r: 1 } } } storage:{} protocol:op_msg 120920ms
2022-05-05T14:04:50.212+0000 I NETWORK [ReplicaSetMonitor-TaskExecutor] DBClientConnection failed to receive message from mongo3:27013 - NetworkTimeout: Socket operation timed out
2022-05-05T14:04:54.858+0000 I NETWORK [LogicalSessionCacheRefresh] DBClientConnection failed to receive message from mongo2:27012 - NetworkTimeout: Socket operation timed out
2022-05-05T14:08:07.298+0000 I REPL [replication-162] Scheduled new oplog query Fetcher source: mongo3:27013 database: local query: { find: "
http://oplog.rs", filter: { ts: { $gte: Timestamp(1651759205, 13) } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 2000, batchSize: 13981010, term: 162, readConcern: { afterClusterTime: Timestamp(0, 1) } } query metadata: { $replData: 1, $oplogQueryData: 1, $readPreference: { mode: "secondaryPreferred" } } active: 1 findNetworkTimeout: 7000ms getMoreNetworkTimeout: 10000ms shutting down?: 0 first: 1 firstCommandScheduler: RemoteCommandRetryScheduler request: RemoteCommand 3440356 -- target:mongo3:27013 db:local cmd:{ find: "
http://oplog.rs", filter: { ts: { $gte: Timestamp(1651759205, 13) } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 2000, batchSize: 13981010, term: 162, readConcern: { afterClusterTime: Timestamp(0, 1) } } active: 1 callbackHandle.valid: 1 callbackHandle.cancelled: 0 attempt: 1 retryPolicy: {type: "NoRetryPolicy"}
2022-05-05T14:08:07.299+0000 I NETWORK [conn38] Error sending response to client: SocketException: Broken pipe. Ending connection from 10.170.0.138:37994 (connection id: 38)
2022-05-05T14:08:07.299+0000 I NETWORK [conn35] Error sending response to client: SocketException: Broken pipe. Ending connection from 10.170.0.138:53380 (connection id: 35)
2022-05-05T14:08:07.299+0000 I NETWORK [conn6] Error sending response to client: SocketException: Broken pipe. Ending connection from 10.170.0.138:56572 (connection id: 6)
2022-05-05T14:08:07.299+0000 I NETWORK [conn43] Error sending response to client: SocketException: Broken pipe. Ending connection from 10.170.0.138:38022 (connection id: 43)
2022-05-05T14:08:07.299+0000 I NETWORK [conn44] Error sending response to client: SocketException: Broken pipe. Ending connection from 10.170.0.137:47460 (connection id: 44)
2022-05-05T14:08:07.299+0000 I ASIO [RS] Connecting to mongo3:27013
2022-05-05T14:08:07.305+0000 I REPL [replexec-316] Member mongo2:27012 is now in state RS_DOWN
...
Resolution
...
Timeline
BST unless otherwise stated
14:03 - Mongo logs on prod-mongodb-replicaset (10.170.0.151) report mongo replicas failure.
14:10 - HttpServerErrorException$InternalServerError reported by ESR-NotificationGenerator in Slack sentry-esr channel.
15:10 - DataAccessResourceFailureException reported 19:43 Failures reported by ESR-AuditService, ESR-NotificationGenerator, ESR-Reconciliation, ESR-AppRecordGenerator and ESR-InboundDataWriter in Slack sentry-esr channel.
...
Root Cause(s)
...
Action Items
...