Date | 26 Jan |
Authors | |
Status | Documenting |
Summary | The store of information that helps ESR & TIS integrate became unavailable |
Impact | The exchange of information between TIS & ESR was disrupted. We will make some checks for missing or erroneous data as a result of the outage. |
...
There is a virtual computer (server) that runs the database. This stopped responding to requests and was flagged by our monitoring systems as having disappeared (similar to having crashed). We restarted the server and checked indicators that normal service had resumed.
We will make made some checks for data that may have been ‘dropped’ or not handled correctly in the ~15 minutes that the database was unavailable. We resubmitted messages for 2 “New post” notifications which were then sent to ESR on 23rd April.
...
Trigger
...
Detection
Monitoring alert in slack.
Attempts to manually connect to the server
...
BST unless otherwise stated
- …What happened before the alert… 12:28:40 - Disk IOPs spike to 328 ops/sec
12:31 BST - Alert message in slack:29:48.244 - Connection between cluster nodes failed
12:29:55.763 - container healthcheck failure
12:31 BST - Alert message in slack
12:38.822 to 12:40:40.159 - Service recovering to a functional state
15:22 - Resubmitted Dead Letter messages for new posts: KSS/FRIM/ITP/800/30162 and KSS/FRIM/ITP/800/30163
...
Root Cause(s)
Nothing more discovered beyond previous outages.
Memory and disk reads spiked.
...
Action Items
Action Items | Owner | :|
---|---|---|
|
...
Lessons Learned
Group review for RCA and identifying action items from the root causes is very useful Despite a planned move to a DBaaS platform, there are still things that might be valuable in the weeks/months before that is in place.