2022-04-22 ESR integration database went down

Date

Apr 22, 2022

Authors

@Joseph (Pepe) Kelly

Status

Documenting

Summary

The store of information that helps ESR & TIS integrate became unavailable

Impact

The exchange of information between TIS & ESR was disrupted. We will make some checks for missing or erroneous data as a result of the outage.

Non-technical Description

There is a virtual computer (server) that runs the database. This stopped responding to requests and was flagged by our monitoring systems as having disappeared (similar to having crashed). We restarted the server and checked indicators that normal service had resumed.

We made some checks for data that may have been ‘dropped’ or not handled correctly in the ~15 minutes that the database was unavailable. We resubmitted messages for 2 “New post” notifications which were then sent to ESR on 23rd April.

 


Trigger

  •  

Detection

  • Monitoring alert in slack.

  • Attempts to manually connect to the server


Resolution

  • Restarted the server

  • TODO: Check audit information/DLQ for failed messages.


Timeline

BST unless otherwise stated

  • Apr 22, 2022 12:28:40 - Disk IOPs spike to 328 ops/sec

  • Apr 22, 2022 12:29:48.244 - Connection between cluster nodes failed

  • Apr 22, 2022 12:29:55.763 - container healthcheck failure

  • Apr 22, 2022 12:31 BST - Alert message in slack

  • Apr 22, 2022 12:38.822 to 12:40:40.159 - Service recovering to a functional state

  • Apr 22, 2022 15:22 - Resubmitted Dead Letter messages for new posts: KSS/FRIM/ITP/800/30162 and KSS/FRIM/ITP/800/30163

 


Root Cause(s)

  • Nothing more discovered beyond previous outages.

  • Memory and disk reads spiked.


Action Items

Action Items

Owner

Action Items

Owner

  • Enable further query logging

@Joseph (Pepe) Kelly


Lessons Learned

  •  Despite a planned move to a DBaaS platform, there are still things that might be valuable in the weeks/months before that is in place.