Date	26 Jan 22 Apr 2022
Authors	Joseph (Pepe) Kelly
Status	Documenting
Summary	The store of information that helps ESR & TIS integrate became unavailable
Impact	The exchange of information between TIS & ESR was disrupted. We will make some checks for missing or erroneous data as a result of the outage.

There is a virtual computer (server) that runs the database. This stopped responding to requests and was flagged by our monitoring systems as having disappeared (similar to having crashed). We restarted the server and checked indicators that normal service had resumed.

We will make made some checks for data that may have been ‘dropped’ or not handled correctly in the ~15 minutes that the database was unavailable. We resubmitted messages for 2 “New post” notifications which were then sent to ESR on 23rd April.

...

Trigger

...

Detection

Monitoring alert in slack.
Attempts to manually connect to the server

...

BST unless otherwise stated

22 Apr 2022 - …What happened before the alert… 12:28:40 - Disk IOPs spike to 328 ops/sec
22 Apr 2022 12:31 BST - Alert message in slack:29:48.244 - Connection between cluster nodes failed
22 Apr 2022 12:29:55.763 - container healthcheck failure
22 Apr 2022 12:31 BST - Alert message in slack
22 Apr 2022 12:38.822 to 12:40:40.159 - Service recovering to a functional state
22 Apr 2022 15:22 - Resubmitted Dead Letter messages for new posts: KSS/FRIM/ITP/800/30162 and KSS/FRIM/ITP/800/30163

...

Root Cause(s)

Nothing more discovered beyond previous outages.
Memory and disk reads spiked.

...

Action Items

:

Action Items	Owner
Enable further query logging	Joseph (Pepe) Kelly

...

Lessons Learned

Group review for RCA and identifying action items from the root causes is very useful Despite a planned move to a DBaaS platform, there are still things that might be valuable in the weeks/months before that is in place.

Versions Compared

Old Version 1

New Version Current

Key

Trigger

Detection

Root Cause(s)

Action Items

Lessons Learned

Page Comparison

Versions Compared

Old Version 1

New Version Current

Key

Trigger

Detection

Root Cause(s)

Action Items

Lessons Learned