Date | |
Authors | |
Status | Documenting |
Summary | |
Impact | Some trainee placements may not have been exported to ESR |
Non-technical Description
Mongo is the database that is used by the ESR services to store data coming into TIS from ESR, and to keep record of the notifications that TIS sends to ESR. When that database fails, the ESR services cannot function. Trainee data is not lost, but the communication between TIS and ESR is disrupted. Since the database failure happened one day after a lot of trainee transfer events, a lot of notifications would be expected to be sent to ESR to keep ESR updated with trainees placements. Some of these may not have been sent due to the database failure.
Trigger
Detection
Messages in Slack sentry-esr channel.
Mongo logs show database failure:
2022-05-05T14:03:25.503+0000 I REPL [replication-162] Restarting oplog query due to error: NetworkInterfaceExceededTimeLimit: error in fetcher batch callback :: caused by :: timed out. Last fetched optime (with hash): { ts: Timestamp(1651759205, 13), t: 162 }[3526795812898854415]. Restarts remaining: 1
2022-05-05T14:03:50.274+0000 I CONNPOOL [Replication] Dropping all pooled connections to mongo2:27012 due to HostUnreachable: Timed out refreshing host
2022-05-05T14:03:51.319+0000 I NETWORK [conn39060] received client metadata from 10.170.0.138:51128 conn39060: { driver: { name: "mongo-java-driver|sync|spring-boot", version: "4.2.3" }, os: { type: "Linux", name: "Linux", architecture: "amd64", version: "4.4.0-210-generic" }, platform: "Java/Oracle Corporation/11.0.14.1+1" }
2022-05-05T14:04:03.382+0000 I COMMAND [LogicalSessionCacheReap] command config.system.sessions command: listIndexes { listIndexes: "system.sessions", cursor: {}, $db: "config" } numYields:0 reslen:449 locks:{ Global: { acquireCount: { r: 1 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { r: 1 } } } storage:{} protocol:op_msg 120920ms
2022-05-05T14:04:50.212+0000 I NETWORK [ReplicaSetMonitor-TaskExecutor] DBClientConnection failed to receive message from mongo3:27013 - NetworkTimeout: Socket operation timed out
2022-05-05T14:04:54.858+0000 I NETWORK [LogicalSessionCacheRefresh] DBClientConnection failed to receive message from mongo2:27012 - NetworkTimeout: Socket operation timed out
2022-05-05T14:08:07.298+0000 I REPL [replication-162] Scheduled new oplog query Fetcher source: mongo3:27013 database: local query: { find: "
http://oplog.rs", filter: { ts: { $gte: Timestamp(1651759205, 13) } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 2000, batchSize: 13981010, term: 162, readConcern: { afterClusterTime: Timestamp(0, 1) } } query metadata: { $replData: 1, $oplogQueryData: 1, $readPreference: { mode: "secondaryPreferred" } } active: 1 findNetworkTimeout: 7000ms getMoreNetworkTimeout: 10000ms shutting down?: 0 first: 1 firstCommandScheduler: RemoteCommandRetryScheduler request: RemoteCommand 3440356 -- target:mongo3:27013 db:local cmd:{ find: "
http://oplog.rs", filter: { ts: { $gte: Timestamp(1651759205, 13) } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 2000, batchSize: 13981010, term: 162, readConcern: { afterClusterTime: Timestamp(0, 1) } } active: 1 callbackHandle.valid: 1 callbackHandle.cancelled: 0 attempt: 1 retryPolicy: {type: "NoRetryPolicy"}
2022-05-05T14:08:07.299+0000 I NETWORK [conn38] Error sending response to client: SocketException: Broken pipe. Ending connection from 10.170.0.138:37994 (connection id: 38)
2022-05-05T14:08:07.299+0000 I NETWORK [conn35] Error sending response to client: SocketException: Broken pipe. Ending connection from 10.170.0.138:53380 (connection id: 35)
2022-05-05T14:08:07.299+0000 I NETWORK [conn6] Error sending response to client: SocketException: Broken pipe. Ending connection from 10.170.0.138:56572 (connection id: 6)
2022-05-05T14:08:07.299+0000 I NETWORK [conn43] Error sending response to client: SocketException: Broken pipe. Ending connection from 10.170.0.138:38022 (connection id: 43)
2022-05-05T14:08:07.299+0000 I NETWORK [conn44] Error sending response to client: SocketException: Broken pipe. Ending connection from 10.170.0.137:47460 (connection id: 44)
2022-05-05T14:08:07.299+0000 I ASIO [RS] Connecting to mongo3:27013
2022-05-05T14:08:07.305+0000 I REPL [replexec-316] Member mongo2:27012 is now in state RS_DOWN
Resolution
Timeline
BST unless otherwise stated
14:03 - Mongo logs on prod-mongodb-replicaset (10.170.0.151) report mongo replica failure.
14:10 - HttpServerErrorException$InternalServerError reported by ESR-NotificationGenerator in Slack sentry-esr channel.
15:10 - 19:43 Failures reported by ESR-AuditService, ESR-NotificationGenerator, ESR-Reconciliation, ESR-AppRecordGenerator and ESR-InboundDataWriter in Slack sentry-esr channel.
Root Cause(s)
Action Items
Action Items | Owner |
---|---|
Add Comment