Date	22 Mar 2021
Authors	Liban Hirey (Unlicensed) Reuben Roberts Marcello Fabbri (Unlicensed)
Status	Resolved - conducting RCA
Summary	Started receiving error messages in the #sentry-esr Slack channel
Impact	ESR stopped working, no files would be processed from that point

Table of Contents

Non-technical Description

Fatal Changes to TIS data that need to be relayed to ESR are published to a message queue which is in turn consumed by the ESR exporter to generate data-change notification files for ESR to process. A fatal failure in the ESR vhost (which enables shared hosting of services - ESR, like TIS, is made up of many separate services) caused ESR processes message-queueing component caused this sequence to fail. Deleting and recreating the vhost, message-queue configuration and then re-configuring and re-running everything recreating the lost data-change messages by reprocessing the TIS data changes from just before the error occurred resolved the problem.

...

Trigger

vhost ‘went down' (The ESR RabbitMQ default virtual host ('/') became inaccessible, so data-change messages could not be published or consumed. something happened that corrupted it, so on restart it couldn’t do so properly).

...

Detection

#sentry-esr Slack channel received 3 “Channel shutdown: channel error; protocol method: #method<channel.close>(reply-code=406, reply-te...“ error messages on Monday morning.

...

Resolution

tried restarting Restarting the Amazon MQ broker .
restarted vhost.
between those two events, the CDC went down.
initial errors, above, were resolved, but a new set of errors were then created: SocketException (4) and CannotCreateTransactionException (2) followed by AmqpIOException (1), another SocketException (1), finally an IllegalStateException (1) and a AmqpRejectAndDontRequeueException (1).
decided to recreate the vhost, by deleting the old and creating a new (couldn’t rename the corrupted one to “vhost_old” or anything, so had to delete it ('TIS-Prod_RabbitMQ') did not resolve the error. The RabbitMQ management panel displayed the message ‘Virtual host / experienced an error on node rabbit@localhost and may be inaccessible’.
It was not possible to successfully restart the RabbitMQ virtual host.
At this point, the ESR-Data_exporter component failed, with Maxwell CDC errors caused by failure to connect to the RabbitMQ instance (e.g. 2021-03-22 12:59:04 WARN An unexpected connection driver error occured (Exception message: Socket closed) …. Caused by: com.rabbitmq.client.ShutdownSignalException: connection error; protocol method: #method<connection.close>(reply-code=541, reply-text=INTERNAL_ERROR - access to vhost '/' refused for user 'tisprodrabbit': vhost '/' is down, class-id=10, method-id=40)).
It was decided to recreate the RabbitMQ virtual host, by deleting the old vhost and recreating it. It was not possible to rename the corrupted vhost, and creating a new vhost with a different name would have meant that a number of components would need to be redeployed with revised configurations. As such, it was deemed least disruptive to simply delete the corrupt vhost (which would delete any queued messages) in order to create a new one with the matching name of the original.
changed the CDC config to point to the new vhost
felt there was a risk we start pushing messages to the wrong place
…
…
decided to reset Maxwell to this 06:25 this morning and replay (ensuring we don't miss processing changes, but introducing the smaller risk of creating duplicates created between 06.25 and when the vhost went down (07.36 - when AmazonMQ started erroring?)
…
…
stopped and restarted CDC (several times)
…
…
problem classed as resolved

...

Timeline

22 Mar 2021: 07:26 - 406 Channel shutdown errors on #sentry-esr.
22 Mar 2021: 09:11 - began investigating seriousness of these alerts.
22 Mar 2021: 12:35 - identified vhost as the blocker to getting ESR up and running again.
22 Mar 2021: 12:58 - reviewed Confluence docs.
22 Mar 2021: 13:06 - fire-fire called.
22 Mar 2021: 13:09 - 2hr fire-fire call with Liban Hirey (Unlicensed) driving.
22 Mar 2021: 14:40 - standard schedule of ESR file processing resumed successfully.

Root Cause(s)

TBC.

...

Action Items

Action Items	Owner
Open a ticket with AWS to see if anything happened on their side	Liban Hirey (Unlicensed)
Check files in S3	Marcello Fabbri (Unlicensed)
Check applicants and notifications created today and match them up against the transactions	Reuben Roberts
More?	Who?

...

Lessons Learned

Sentry alerts were the only detection we had.
Still not really enough knowledge / confidence in the wider team with the ESR services.
The whole team were, however, quick to muck in and try to logically work through the fire-fire, despite the absence of Joseph (Pepe) Kelly Andy Dingley and John Simmons (Deactivated) (and others).
Team found it difficult to determine the implications of the failure, and therefore the priority of the response.
We have two tech leads that flagrantly flout the rules of booked annual leave! Many thanks, guys.
Do we feel that this was an edge case one-off failure, or is it an indication that we need to proactively encourage the team to become more familiar with the ESR services?

Versions Compared

Old Version 5

New Version 6

Key

Non-technical Description

Trigger

Detection

Resolution

Timeline

Root Cause(s)

Action Items

Lessons Learned

Page Comparison

Versions Compared

Old Version 5

New Version 6

Key

Non-technical Description

Trigger

Detection

Resolution

Timeline

Root Cause(s)

Action Items

Lessons Learned