Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Date

26 Jan

Authors

Joseph (Pepe) Kelly

Status

Documenting

Summary

The store of information that helps ESR & TIS integrate became unavailable

Impact

The exchange of information between TIS & ESR was disrupted. We will make some checks for missing or erroneous data as a result of the outage.

...

There is a virtual computer (server) that runs the database. This stopped responding to requests and was flagged by our monitoring systems as having disappeared (similar to having crashed). We restarted the server and checked indicators that normal service had resumed.

We will make made some checks for data that may have been ‘dropped’ or not handled correctly in the ~15 minutes that the database was unavailable. We resubmitted messages for 2 “New post” notifications which were then sent to ESR on 23rd April.

...

Trigger

...

Detection

  • Monitoring alert in slack.

  • Attempts to manually connect to the server

...

BST unless otherwise stated

  • - …What happened before the alert… 12:28:40 - Disk IOPs spike to 328 ops/sec

  • 12:31 BST - Alert message in slack:29:48.244 - Connection between cluster nodes failed

  • 12:29:55.763 - container healthcheck failure

  • 12:31 BST - Alert message in slack

  • 12:38.822 to 12:40:40.159 - Service recovering to a functional state

  • 15:22 - Resubmitted Dead Letter messages for new posts: KSS/FRIM/ITP/800/30162 and KSS/FRIM/ITP/800/30163

...

Root Cause(s)

  • Nothing more discovered beyond previous outages.

  • Memory and disk reads spiked.

...

Action Items

:

Action Items

Owner

  • Enable further query logging

Joseph (Pepe) Kelly

...

Lessons Learned

  •  Group review for RCA and identifying action items from the root causes is very useful Despite a planned move to a DBaaS platform, there are still things that might be valuable in the weeks/months before that is in place.