Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Date

Authors

Joseph (Pepe) Kelly Reuben Roberts

Status

DocumentingDone

Summary

Impact

Some trainee placements may not have been exported to ESR

...

Mongo is the database that is used by the TIS ESR services to store trainee data coming into TIS from ESR, and to keep record of the notifications of trainee data changes that TIS in turn sends to ESR. When that database fails, the ESR services cannot function. Trainee data is not lost, but the communication between TIS and ESR is disrupted. Since in this case the database failure happened one day after a large number of trainee transfer events, a large volume of notifications would be expected to be sent to ESR to keep ESR updated with trainees placements. Some of these may not have been sent due to the database failure.

...

Trigger

  • Accumulated Server load

Detection

  • Messages in Slack sentrymonitoring-esrprod channel.

  • Mongo logs show connectivity failure between replicas:
    Mongo1:
    2022-05-05T14:03:25.503+0000 I REPL [replication-162] Restarting oplog query due to error: NetworkInterfaceExceededTimeLimit: error in fetcher batch callback :: caused by :: timed out. Last fetched optime (with hash): { ts: Timestamp(1651759205, 13), t: 162 }[3526795812898854415]. Restarts remaining: 1
    2022-05-05T14:03:50.274+0000 I CONNPOOL [Replication] Dropping all pooled connections to mongo2:27012 due to HostUnreachable: Timed out refreshing host
    2022-05-05T14:03:51.319+0000 I NETWORK [conn39060] received client metadata from 10.170.0.138appserver:51128 conn39060: { driver: { name: "mongo-java-driver|sync|spring-boot", version: "4.2.3" }, os: { type: "Linux", name: "Linux", architecture: "amd64", version: "4.4.0-210-generic" }, platform: "Java/Oracle Corporation/11.0.14.1+1" }
    2022-05-05T14:04:03.382+0000 I COMMAND [LogicalSessionCacheReap] command config.system.sessions command: listIndexes { listIndexes: "system.sessions", cursor: {}, $db: "config" } numYields:0 reslen:449 locks:{ Global: { acquireCount: { r: 1 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { r: 1 } } } storage:{} protocol:op_msg 120920ms
    2022-05-05T14:04:50.212+0000 I NETWORK [ReplicaSetMonitor-TaskExecutor] DBClientConnection failed to receive message from mongo3:27013 - NetworkTimeout: Socket operation timed out
    2022-05-05T14:04:54.858+0000 I NETWORK [LogicalSessionCacheRefresh] DBClientConnection failed to receive message from mongo2:27012 - NetworkTimeout: Socket operation timed out
    2022-05-05T14:08:07.298+0000 I REPL [replication-162] Scheduled new oplog query Fetcher source: mongo3:27013 database: local query: { find: "http://oplog.rs", filter: { ts: { $gte: Timestamp(1651759205, 13) } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 2000, batchSize: 13981010, term: 162, readConcern: { afterClusterTime: Timestamp(0, 1) } } query metadata: { $replData: 1, $oplogQueryData: 1, $readPreference: { mode: "secondaryPreferred" } } active: 1 findNetworkTimeout: 7000ms getMoreNetworkTimeout: 10000ms shutting down?: 0 first: 1 firstCommandScheduler: RemoteCommandRetryScheduler request: RemoteCommand 3440356 -- target:mongo3:27013 db:local cmd:{ find: "http://oplog.rs", filter: { ts: { $gte: Timestamp(1651759205, 13) } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 2000, batchSize: 13981010, term: 162, readConcern: { afterClusterTime: Timestamp(0, 1) } } active: 1 callbackHandle.valid: 1 callbackHandle.cancelled: 0 attempt: 1 retryPolicy: {type: "NoRetryPolicy"}
    2022-05-05T14:08:07.299+0000 I NETWORK [conn38] Error sending response to client: SocketException: Broken pipe. Ending connection from 10.170.0.138appserver:37994 (connection id: 38)
    2022-05-05T14:08:07.299+0000 I NETWORK [conn35] Error sending response to client: SocketException: Broken pipe. Ending connection from 10.170.0.138appserver:53380 (connection id: 35)
    2022-05-05T14:08:07.299+0000 I NETWORK [conn6] Error sending response to client: SocketException: Broken pipe. Ending connection from 10.170.0.138appserver:56572 (connection id: 6)
    2022-05-05T14:08:07.299+0000 I NETWORK [conn43] Error sending response to client: SocketException: Broken pipe. Ending connection from 10.170.0.138appserver:38022 (connection id: 43)
    2022-05-05T14:08:07.299+0000 I NETWORK [conn44] Error sending response to client: SocketException: Broken pipe. Ending connection from 10.170.0.137appserver:47460 (connection id: 44)
    2022-05-05T14:08:07.299+0000 I ASIO [RS] Connecting to mongo3:27013
    2022-05-05T14:08:07.305+0000 I REPL [replexec-316] Member mongo2:27012 is now in state RS_DOWN
    Mongo2:
    2022-05-05T14:02:37.564+0000 I CONNPOOL [RS] Ending connection to host mongo3:27013 due to bad connection status; 0 connections to that host remain open
    2022-05-05T14:02:40.254+0000 I REPL_HB [replexec-30] Error in heartbeat (requestId: 3439616) to mongo1:27011, response status: NetworkInterfaceExceededTimeLimit: Remote command timed out while waiting to get a connection from the pool, took 71075ms, timeout was set to 10000ms
    2022-05-05T14:02:41.527+0000 I CONNPOOL [Replication] Ending connection to host mongo1:27011 due to bad connection status; 0 connections to that host remain open
    2022-05-05T14:02:57.263+0000 I COMMAND [PeriodicTaskRunner] task: DBConnectionPool-cleaner took: 1225ms
    2022-05-05T14:03:03.499+0000 I NETWORK [listener] connection accepted from 10.170.0.138appserver:35184 #37606 (44 connections now open)
    2022-05-05T14:03:17.640+0000 I COMMAND [LogicalSessionCacheReap] command config.system.sessions command: listIndexes { listIndexes: "system.sessions", cursor: {}, $db: "config" } numYields:0 reslen:449 locks:{ Global: { acquireCount: { r: 1 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { r: 1 } } } storage:{} protocol:op_msg 74513ms
    2022-05-05T14:03:39.033+0000 I NETWORK [conn37605] received client metadata from 10.170.0.137appserver:38808 conn37605: { driver: { name: "mongo-java-driver|sync|spring-boot", version: "4.0.5" }, os: { type: "Linux", name: "Linux", architecture: "amd64", version: "4.4.0-210-generic" }, platform: "Java/Oracle Corporation/11.0.11+9" }
    2022-05-05T14:05:29.136+0000 I NETWORK [ReplicaSetMonitor-TaskExecutor] DBClientConnection failed to receive message from mongo2:27012 - NetworkTimeout: Socket operation timed out
    2022-05-05T14:05:38.889+0000 I NETWORK [LogicalSessionCacheRefresh] DBClientConnection failed to receive message from mongo3:27013 - NetworkTimeout: Socket operation timed out
    2022-05-05T14:05:49.315+0000 I REPL [replication-151] Scheduled new oplog query Fetcher source: mongo3:27013 database: local query: { find: "http://oplog.rs", filter: { ts: { $gte: Timestamp(1651759205, 13) } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 2000, batchSize: 13981010, term: 162, readConcern: { afterClusterTime: Timestamp(0, 1) } } query metadata: { $replData: 1, $oplogQueryData: 1, $readPreference: { mode: "secondaryPreferred" } } active: 1 findNetworkTimeout: 7000ms getMoreNetworkTimeout: 10000ms shutting down?: 0 first: 1 firstCommandScheduler: RemoteCommandRetryScheduler request: RemoteCommand 3439619 -- target:mongo3:27013 db:local cmd:{ find: "http://oplog.rs", filter: { ts: { $gte: Timestamp(1651759205, 13) } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 2000, batchSize: 13981010, term: 162, readConcern: { afterClusterTime: Timestamp(0, 1) } } active: 1 callbackHandle.valid: 1 callbackHandle.cancelled: 0 attempt: 1 retryPolicy: {type: "NoRetryPolicy"}
    2022-05-05T14:08:07.298+0000 I ASIO [RS] Connecting to mongo3:27013
    2022-05-05T14:08:07.304+0000 I REPL [replexec-30] Member mongo1:27011 is now in state RS_DOWN

...

Resolution

  • Restarted server, checked for applicant records that might not have been exported

...

Timeline

BST UTC unless otherwise stated

  • 14:03 - Mongo logs on prod-mongodb-replicaset (10.170.0.151) report mongo replica failure, or connectivity failure between them.

  • 14:10 - HttpServerErrorException$InternalServerError reported by ESR-NotificationGenerator in Slack sentry-esr channel.

  • 14:17 - Mongo recovery logs problems with finding a sync source to repair its data(??)and re-establishment of sync between replicas.

  • 14:17 - Mongo3 instability / 'busy-ness' continues as per e.g.
    2022-05-05T15:37:26.017+0000 I COMMAND [conn209] command esrinbounddatawriter.person_mapping command: find { find: "person_mapping", filter: { assignmentId: 21695773, positionId: 5090293 }, limit: 2, $db: "esrinbounddatawriter", $clusterTime: { clusterTime: Timestamp(1651765045, 24), signature: { hash: BinData(0, 7CEAE9D1B5F05579EDB760935C42FFA0BCAAC9B7), keyId: 7050087017069150210 } }, lsid: { id: UUID("a314d8e2-52be-46ad-938a-91ff1b21ae
    ef") } } planSummary: COLLSCAN keysExamined:0 docsExamined:257493 cursorExhausted:1 numYields:2011 nreturned:1 reslen:411 locks:{ Global: { acquireCount: { r: 2012 } }, Database: { acquireCount: { r: 2012 } }, Collection: { acquireCount: { r: 2012 } } } storage:{} protocol:op_msg 115ms
    2022-05-05T15:37:26.394+0000 I TXN [conn236] transaction parameters:{ lsid: { id: UUID("0d220ebd-f2ce-4bd8-bcce-414c08a4716b"), uid: BinData(0, 3911A6AADDA0C6EB92636A3942D0CDDA85E6EB84D7936794A99B8EE3518382CD) }, txnNumber: 61, autocommit: false, readConcern: { level: "snapshot" } }, readTimestamp:Timestamp(1651765045, 25), keysExamined:0 docsExamined:318159 terminationCause:committed timeActiveMicros:120840 timeInactiveMicros:8250 nu
    mYields:0 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, Database: { acquireCount: { w: 1 } }, Collection: { acquireCount: { w: 1 } } } storage:{} 129ms

  • 18:47 - Mongo3 stabilises? The logging goes back to looking ‘normal’.

  • 15:10 - 19:43 Failures reported by ESR-AuditService, ESR-NotificationGenerator, ESR-Reconciliation, ESR-AppRecordGenerator and ESR-InboundDataWriter in Slack sentry-esr channel.

  • 09:30 - 15:30 Found and forced recalculation of applicant records to be exported.

...

Root Cause(s)

  • Server became unresponsive.

...

Action Items

Action Items

Owner

DONE: Check app records for ESR 7700 (not recorded as exported in TIS) → 3408 (which have been reconciled at some point) → 1,304 (Approved, deduplicated) → 11 (real person records requiring generation with a “no-change” update from TIS)

Joseph (Pepe) Kelly

...

Lessons Learned

  • We didn’t pick up on the significantly busier time of year / discuss any actions after Wed 4th May.