Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Date

Authors

Liban Hirey (Unlicensed) Reuben Roberts Marcello Fabbri (Unlicensed)

Status

Resolved - conducting RCA

Summary

Started receiving error messages in the #sentry-esr Slack channel

Impact

ESR stopped working, no files would be processed from that point

Table of Contents

Non-technical Description

  • Fatal Changes to TIS data that need to be relayed to ESR are published to a message queue which is in turn consumed by the ESR exporter to generate data-change notification files for ESR to process. A fatal failure in the ESR vhost (which enables shared hosting of services - ESR, like TIS, is made up of many separate services) caused ESR processes message-queueing component caused this sequence to fail. Deleting and recreating the vhost, message-queue configuration and then re-configuring and re-running everything recreating the lost data-change messages by reprocessing the TIS data changes from just before the error occurred resolved the problem.

...

Trigger

  • vhost ‘went down' (The ESR RabbitMQ default virtual host ('/') became inaccessible, so data-change messages could not be published or consumed. something happened that corrupted it, so on restart it couldn’t do so properly).

...

Detection

...

Resolution

  • tried restarting Restarting the Amazon MQ broker .

  • restarted vhost.

  • between those two events, the CDC went down.

  • initial errors, above, were resolved, but a new set of errors were then created: SocketException (4) and CannotCreateTransactionException (2) followed by AmqpIOException (1), another SocketException (1), finally an IllegalStateException (1) and a AmqpRejectAndDontRequeueException (1).

  • decided to recreate the vhost, by deleting the old and creating a new (couldn’t rename the corrupted one to “vhost_old” or anything, so had to delete it ('TIS-Prod_RabbitMQ') did not resolve the error. The RabbitMQ management panel displayed the message ‘Virtual host / experienced an error on node rabbit@localhost and may be inaccessible’.

  • It was not possible to successfully restart the RabbitMQ virtual host.

  • At this point, the ESR-Data_exporter component failed, with Maxwell CDC errors caused by failure to connect to the RabbitMQ instance (e.g. 2021-03-22 12:59:04 WARN An unexpected connection driver error occured (Exception message: Socket closed) …. Caused by: com.rabbitmq.client.ShutdownSignalException: connection error; protocol method: #method<connection.close>(reply-code=541, reply-text=INTERNAL_ERROR - access to vhost '/' refused for user 'tisprodrabbit': vhost '/' is down, class-id=10, method-id=40)).

  • It was decided to recreate the RabbitMQ virtual host, by deleting the old vhost and recreating it. It was not possible to rename the corrupted vhost, and creating a new vhost with a different name would have meant that a number of components would need to be redeployed with revised configurations. As such, it was deemed least disruptive to simply delete the corrupt vhost (which would delete any queued messages) in order to create a new one with the matching name of the original.

  • changed the CDC config to point to the new vhost

  • felt there was a risk we start pushing messages to the wrong place

  • decided to reset Maxwell to this 06:25 this morning and replay (ensuring we don't miss processing changes, but introducing the smaller risk of creating duplicates created between 06.25 and when the vhost went down (07.36 - when AmazonMQ started erroring?)

  • stopped and restarted CDC (several times)

  • problem classed as resolved

...

Timeline

  • : 07:26 - 406 Channel shutdown errors on #sentry-esr.

  • : 09:11 - began investigating seriousness of these alerts.

  • : 12:35 - identified vhost as the blocker to getting ESR up and running again.

  • : 12:58 - reviewed Confluence docs.

  • : 13:06 - fire-fire called.

  • : 13:09 - 2hr fire-fire call with Liban Hirey (Unlicensed) driving.

  • : 14:40 - standard schedule of ESR file processing resumed successfully.

Root Cause(s)

  • TBC.

...

Action Items

Action Items

Owner

Open a ticket with AWS to see if anything happened on their side

Liban Hirey (Unlicensed)

Check files in S3

Marcello Fabbri (Unlicensed)

Check applicants and notifications created today and match them up against the transactions

Reuben Roberts

More?

Who?

...

Lessons Learned

  • Sentry alerts were the only detection we had.

  • Still not really enough knowledge / confidence in the wider team with the ESR services.

  • The whole team were, however, quick to muck in and try to logically work through the fire-fire, despite the absence of Joseph (Pepe) Kelly Andy Dingley and John Simmons (Deactivated) (and others).

  • Team found it difficult to determine the implications of the failure, and therefore the priority of the response.

  • We have two tech leads that flagrantly flout the rules of booked annual leave! Many thanks, guys.

  • Do we feel that this was an edge case one-off failure, or is it an indication that we need to proactively encourage the team to become more familiar with the ESR services?