Date	17 Aug 2021
Authors	Joseph (Pepe) Kelly
Status	Documenting
Summary	The database used for holding information sent to & received from ESR was unavailable. This meant data was unable to be processed. https://hee-tis.atlassian.net/browse/TIS21-1980
ImpactSome applicant .	1 trust had a delay of several days for the creation of 1 new post

Table of Contents

Non-technical Description

The database got very busy and wasn’t available to process changes from TIS. A notification for a new post was not sent out to trusts. We also did not update other notifications. Other “updates” failed but these did not contain any changes relevant to Notifications or Applicants to be sent to ESR. We have replayed a message manually to generate the necessary notification for the creation of post NWN/REM21/006/HT/001.

...

Trigger

Under review

...

Detection

Application Error Alerting to slack

...

Resolution

Restarted Hard Stopped and started the VM
Checking Dead Letter Queue .and replaying

...

Timeline

17 Aug 2021 - 15:52 - First alert messages received
17 Aug 2021 - 18:01 - Server restarted
17 Aug 2021 - 19 Aug 2021 - Confirmed there were no Applicants or future notifications affected by the outage
20 Aug 2021 - 12:20 - Created the “New Post” (Type 5) notification

...

Root Cause(s)

Slack Message triggered by exceptions in all ESR services.
The ESR services were unable to connect to the database.
Hearbeat checks between replicaSet nodes failed.
The VM was non-responsive

...

Action Items

Action Items	Owner
Add more resilience because moving to a managed service was more complex than anticipated	https://hee-tis.atlassian.net/browse/TIS21-489
Review Applicants & Notifications that might need to be generated	Joseph (Pepe) Kelly Most dropped events were for placements that have started and so were disregarded as they would not have produced applicant records eligible for sending to ESR. We also found: 1 record that had recovered, creating the necessary APP record 1 record that a notification (to send Feb '22) that was New post hasn’t been notified to trusts: `MER/REM21/006/HT/001`. We will ask the regional lead if Trusts can be contacted manually There were additional unconfigured attempts for some notifications, e.g. correlation id: `6e004633-6e34-426b-94b3-69b2adfbb51c`
Improve alerting from Mongo nodes?
Create some information on replaying messages	Joseph (Pepe) Kelly

...

Lessons Learned

Replaying messages manually is possible.
Don’t put off tech improvement just because it will be obsolete “later”.

Versions Compared

Old Version 3

New Version 4

Key

Non-technical Description

Trigger

Detection

Resolution

Timeline

Root Cause(s)

Action Items

Lessons Learned

Page Comparison

Versions Compared

Old Version 3

New Version 4

Key

Non-technical Description

Trigger

Detection

Resolution

Timeline

Root Cause(s)

Action Items

Lessons Learned