Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Date

Authors

Status

In progressDocumenting

Summary

Database used fro exchanging information with ESR failed

Impact

Files with information from ESR weren’t processed for several hours

Non-technical Description

ESR had another period of failing to send files on the day they were generated, this meant a greater number of files, generated between Friday 29th July and Monday 1st August were all sent in a short space of time.

This is usually handled by application but this time, the database stopped responding. The services that store information failed and a number of files were not processed. The built in alerting notified the team and after verifying the status of a number of failed individual transactions, we resolved the immediate problem and resent the instructions to process the files listed below.

...

Trigger

  • Exceptions reported via Slack

...

  • Sentry alerting

...

Resolution

  • Force stopped the database server and restarted it, then requested processing of a number of files

...

Timeline

BST unless otherwise stated

  • 2022-08-01 16:11 ESR processing failed messages start appearing on Slack #monitoring-esr channel

  • 2022-08-01 16:30ish ESR processes on Prod blue and green stopped

  • 2022-08-01 16:32ish Prod MongoDB server stopped

  • 2022-08-01 18:24 Prod MongoDB server started

  • 2022-08-01 20:43 All ESR processes restarted in defined order

  • 2022-08-01 20:36-21:21 Failed and missed RMC files processed in order defined below

...

Root Cause(s)

...

Action Items

...