Date |
|
Authors | |
Status | Documenting |
Summary | The database used for holding information sent to & received from ESR was unavailable. This meant data was unable to be processed. |
Impact | 1 trust had a delay of several days for the creation of 1 new post |
Non-technical Description
The database got very busy and wasn’t available to process changes from TIS. A notification for a new post was not sent out to trusts. We also did not update other notifications. Other “updates” failed but these did not contain any changes relevant to Notifications or Applicants to be sent to ESR. We have replayed a message manually to generate the necessary notification for the creation of post NWN/REM21/006/HT/001
.
Trigger
Under review
Detection
Application Error Alerting to slack
Resolution
Hard Stopped and started the VM
Checking Dead Letter Queue and replaying
Timeline
- 15:52 - First alert messages received
- 18:01 - Server restarted
- - Confirmed there were no Applicants or future notifications affected by the outage
- 12:20 - Created the “New Post” (Type 5) notification
Root Cause(s)
Slack Message triggered by exceptions in all ESR services.
The ESR services were unable to connect to the database.
Hearbeat checks between replicaSet nodes failed.
The VM was non-responsive
Action Items
Action Items | Owner |
---|---|
Add more resilience because moving to a managed service was more complex than anticipated | |
Review Applicants & Notifications that might need to be generated | Joseph (Pepe) Kelly Most dropped events were for placements that have started and so were disregarded as they would not have produced applicant records eligible for sending to ESR. We also found:
|
Improve alerting from Mongo nodes? | |
Create some information on replaying messages |
Lessons Learned
Replaying messages manually is possible.
Don’t put off tech improvement just because it will be obsolete “later”.
0 Comments