Date | |
Authors | |
Status | Documenting |
Summary | The store of information that helps ESR & TIS integrate became unavailable |
Impact | The exchange of information between TIS & ESR was disrupted. We will make some checks for missing or erroneous data as a result of the outage. |
Non-technical Description
There is a virtual computer (server) that runs the database. This stopped responding to requests and was flagged by our monitoring systems as having disappeared (similar to having crashed). We restarted the server and checked indicators that normal service had resumed.
We will make some checks for data that may have been ‘dropped’ or not handled correctly in the ~15 minutes that the database was unavailable.
Trigger
.
Detection
Monitoring alert in slack.
Attempts to manually connect to the server
Resolution
Restarted the server
TODO: Check audit information/DLQ for failed messages.
Timeline
BST unless otherwise stated
-
12:29:48.244 - Connection between cluster nodes failed
12:40:38.822 to -
12:31 BST - Alert message in slack
-
15:22 - Resubmitted Dead Letter messages for new posts: KSS/FRIM/ITP/800/30162 and KSS/FRIM/ITP/800/30163
-
-
-
Root Cause(s)
Action Items
Action Items | Owner |
---|---|
| |
: |
Lessons Learned
Group review for RCA and identifying action items from the root causes is very useful.
0 Comments