2021-08-17 ESR Integration database unavailable

Date

Aug 17, 2021

Authors

@Joseph (Pepe) Kelly

Status

Documenting

Summary

The database used for holding information sent to & received from ESR was unavailable. This meant data was unable to be processed.

https://hee-tis.atlassian.net/browse/TIS21-1980

Impact

1 trust had a delay of several days for the creation of 1 new post

Non-technical Description

The database got very busy and wasn’t available to process changes from TIS. A notification for a new post was not sent out to trusts. We also did not update other notifications. Other “updates” failed but these did not contain any changes relevant to Notifications or Applicants to be sent to ESR. We have replayed a message manually to generate the necessary notification for the creation of post NWN/REM21/006/HT/001.


Trigger

 

Cluster became unhealthy. Contributing factors under review.


Detection

  • Application Error Alerting to slack

 


Resolution

  • Hard Stopped and started the VM

  • Checking Dead Letter Queue and replaying


Timeline

  • Aug 17, 2021 - 15:52 - First alert messages received

  • Aug 17, 2021 - 18:01 - Server restarted

  • Aug 17, 2021 - Aug 19, 2021 - Confirmed there were no Applicants or future notifications affected by the outage

  • Aug 20, 2021 - 12:20 - Created the “New Post” (Type 5) notification

 


Root Cause(s)

  • Slack Message triggered by exceptions in all ESR services.

  • The ESR services were unable to connect to the database.

  • Heartbeat checks between replicaSet nodes failed.

  • The VM was non-responsive.

  • The following MongoDB (mongo1) log samples give an indication of the instability of the cluster at that time:
    {"log":"2021-08-17T14:52:50.374+0000 I REPL_HB [replexec-67] Error in heartbeat (requestId: 3436593) to mongo3:27013, response status: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit\n","stream":"stdout","time":"2021-08-17T14:52:50.582068394Z"}
    {"log":"2021-08-17T14:52:50.598+0000 I REPL [replexec-67] Member mongo3:27013 is now in state RS_DOWN\n","stream":"stdout","time":"2021-08-17T14:52:50.598267401Z"}

    {"log":"2021-08-17T14:53:14.795+0000 I REPL_HB [replexec-67] Error in heartbeat (requestId: 3436596) to mongo3:27013, response status: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit\n","stream":"stdout","time":"2021-08-17T14:53:16.436295842Z"}

    {"log":"2021-08-17T14:53:17.723+0000 I REPL_HB [replexec-68] Error in heartbeat (requestId: 3436597) to mongo2:27012, response status: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit\n","stream":"stdout","time":"2021-08-17T14:53:19.63301921Z"}

    {"log":"2021-08-17T14:53:19.615+0000 I REPL [replexec-67] Not starting an election, since we are not electable due to: Not standing for election because I cannot see a majority (mask 0x1)\n","stream":"stdout","time":"2021-08-17T14:53:19.681137738Z"}

    {"log":"2021-08-17T14:54:36.354+0000 I REPL [replexec-68] VoteRequester(term 101 dry run) received a no vote from mongo2:27012 with reason "candidate's term (101) is lower than mine (102)"; response message: { term: 102, voteGranted: false, reason: "candidate's term (101) is lower than mine (102)", ok: 1.0, operationTime: Timestamp(1629211887, 1), $clusterTime: { clusterTime: Timestamp(1629211887, 1), signature: { hash: BinData(0, 0000000000000000000000000000000000000000), keyId: 0 } } }\n","stream":"stdout","time":"2021-08-17T14:54:37.533307206Z"}
    {"log":"2021-08-17T14:54:36.354+0000 I REPL [replexec-68] not running for primary, we have been superseded already during dry run. original term: 101, current term: 102\n","stream":"stdout","time":"2021-08-17T14:54:37.533313897Z"}

    {"log":"2021-08-17T14:55:02.747+0000 I CONNPOOL [RS] Ending connection to host mongo2:27012 due to bad connection status; 1 connections to that host remain open\n","stream":"stdout","time":"2021-08-17T14:55:09.266545288Z"}

    {"log":"2021-08-17T14:56:03.032+0000 I REPL [SyncSourceFeedback] SyncSourceFeedback error sending update to mongo2:27012: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit\n","stream":"stdout","time":"2021-08-17T14:56:06.838424722Z"}
    {"log":"2021-08-17T14:56:03.100+0000 I ASIO [RS] Failed to connect to mongo2:27012 - HostUnreachable: Error connecting to mongo2:27012 :: caused by :: Could not find address for mongo2:27012: SocketException: Host not found (non-authoritative), try again later\n","stream":"stdout","time":"2021-08-17T14:56:06.838430607Z"}


Action Items

Action Items

Owner

Action Items

Owner

Add more resilience because moving to a managed service was more complex than anticipated

https://hee-tis.atlassian.net/browse/TIS21-489

Review Applicants & Notifications that might need to be generated

@Joseph (Pepe) Kelly Most dropped events were for placements that have started and so were disregarded as they would not have produced applicant records eligible for sending to ESR. We also found:

  • 1 record that had recovered, creating the necessary APP record

  • 1 record that a notification (to send Feb '22) that was

  • New post hasn’t been notified to trusts: MER/REM21/006/HT/001. We will ask the regional lead if Trusts can be contacted manually

  • There were additional unconfigured attempts for some notifications, e.g. correlation id: 6e004633-6e34-426b-94b3-69b2adfbb51c

Improve alerting from Mongo nodes?

 

Create some information on replaying messages

@Joseph (Pepe) Kelly


Lessons Learned

  • Replaying messages manually is possible.

  • Don’t put off tech improvement just because it will be obsolete “later”.