Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Date

Authors

Liban Hirey (Unlicensed) Reuben Roberts Marcello Fabbri (Unlicensed)

Status

Resolved - conducting RCA

Summary

Started receiving error messages in the #sentry-esr Slack channel

Impact

ESR stopped working, no files would be processed from that point

...

  • Fatal failure in the ESR vhost (which enables shared hosting of services - ESR, like TIS, is made up of many separate services) caused ESR processes to fail. Deleting and recreating the vhost, and then re-configuring and re-running everything resolved the problem.

...

Trigger

  • vhost ‘went down' (something happened that corrupted it, so on restart it couldn’t do so properly).

...

...

Resolution

  • tried restarting the broker.

  • restarted vhost.

  • between those two events, the CDC went down.

  • initial errors, above, were resolved, but a new set of errors were then created: SocketException (4) and CannotCreateTransactionException (2) followed by AmqpIOException (1), another SocketException (1), finally an IllegalStateException (1) and a AmqpRejectAndDontRequeueException (1).

  • decided to recreate the vhost, by deleting the old and creating a new (couldn’t rename the corrupted one to “vhost_old” or anything, so had to delete it in order to create a new one with the matching name of the original.

  • changed the CDC config to point to the new vhost

  • felt there was a risk we start pushing messages to the wrong place

  • decided to reset Maxwell to this 06:25 this morning and replay (ensuring we don't miss processing changes, but introducing the smaller risk of creating duplicates created between 06.25 and when the vhost went down (07.36 - when AmazonMQ started erroring?)

  • stopped and restarted CDC (several times)

  • problem classed as resolved

...

Timeline

  • : 07:26 - 406 Channel shutdown errors on #sentry-esr.

  • : 09:11 - began investigating seriousness of these alerts.

  • : 12:35 - identified vhost as the blocker to getting ESR up and running again.

  • : 12:58 - reviewed Confluence docs.

  • : 13:06 - fire-fire called.

  • : 13:09 - 2hr fire-fire call with Liban Hirey (Unlicensed) driving.

  • : 14:40 - standard schedule of ESR file processing resumed successfully.

Root Cause(s)

  • TBC.

...

Action Items

Action Items

Owner

Open a ticket with AWS to see if anything happened on their side

Liban Hirey (Unlicensed)

Check files in S3

Marcello Fabbri (Unlicensed)

Check applicants and notifications created today and match them up against the transactions

Reuben Roberts

More?

Who?

...

Lessons Learned

  • Sentry alerts were the only detection we had.

  • Still not really enough knowledge / confidence in the wider team with the ESR services.

  • The whole team were, however, quick to muck in and try to logically work through the fire-fire, despite the absence of Joseph (Pepe) Kelly Andy Dingley and John Simmons (Deactivated) (and others).

  • Team found it difficult to determine the implications of the failure, and therefore the priority of the response.

  • We have two tech leads that flagrantly flout the rules of booked annual leave! Many thanks, guys.

  • Do we feel that this was an edge case one-off failure, or is it an indication that we need to proactively encourage the team to become more familiar with the ESR services?