Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Current »

Date

Authors

Joseph (Pepe) Kelly

Status

Documenting

Summary

The store of information that helps ESR & TIS integrate became unavailable

Impact

The exchange of information between TIS & ESR was disrupted. We will make some checks for missing or erroneous data as a result of the outage.

Non-technical Description

There is a virtual computer (server) that runs the database. This stopped responding to requests and was flagged by our monitoring systems as having disappeared (similar to having crashed). We restarted the server and checked indicators that normal service had resumed.

We will make some checks for data that may have been ‘dropped’ or not handled correctly in the ~15 minutes that the database was unavailable.


Trigger

  • .

Detection

  • Monitoring alert in slack.

  • Attempts to manually connect to the server


Resolution

  • Restarted the server

  • TODO: Check audit information/DLQ for failed messages.


Timeline

BST unless otherwise stated

  • -

  • 12:29:48.244 - Connection between cluster nodes failed

  • 12:40:38.822 to -

  • 12:31 BST - Alert message in slack

  • -

  • 15:22 - Resubmitted Dead Letter messages for new posts: KSS/FRIM/ITP/800/30162 and KSS/FRIM/ITP/800/30163

  • -

  • -

  • -


Root Cause(s)


Action Items

Action Items

Owner

:


Lessons Learned

  •  Group review for RCA and identifying action items from the root causes is very useful.

  • No labels

0 Comments

You are not logged in. Any changes you make will be marked as anonymous. You may want to Log In if you already have an account.