Date	22 Mar 2021
Authors	Liban Hirey (Unlicensed) Reuben Roberts Marcello Fabbri (Unlicensed)
Status	Resolved - conducting RCA
Summary	Started receiving error messages in the #sentry-esr Slack channel
Impact	ESR stopped working, no files would be processed from that point

Non-technical Description

Fatal failure in the ESR vhost (which enables shared hosting of services - ESR, like TIS, is made up of many separate services) caused ESR processes to fail. Deleting and recreating the vhost, and then re-configuring and re-running everything resolved the problem.

Trigger

vhost ‘went down' (something happened that corrupted it, so on restart it couldn’t do so properly).

tried restarting the broker.
restarted vhost.
between those two events, the CDC went down.
initial errors, above, were resolved, but a new set of errors were then created: SocketException (4) and CannotCreateTransactionException (2) followed by AmqpIOException (1), another SocketException (1), finally an IllegalStateException (1) and a AmqpRejectAndDontRequeueException (1).
decided to recreate the vhost, by deleting the old and creating a new (couldn’t rename the corrupted one to “vhost_old” or anything, so had to delete it in order to create a new one with the matching name of the original.
changed the CDC config to point to the new vhost
felt there was a risk we start pushing messages to the wrong place
…
…
decided to reset Maxwell to this 06:25 this morning and replay (ensuring we don't miss processing changes, but introducing the smaller risk of creating duplicates created between 06.25 and when the vhost went down (07.36 - when AmazonMQ started erroring?)
…
…
stopped and restarted CDC (several times)
…
…
problem classed as resolved

22 Mar 2021: 07:26 - 406 Channel shutdown errors on #sentry-esr.
22 Mar 2021: 09:11 - began investigating seriousness of these alerts.
22 Mar 2021: 12:35 - identified vhost as the blocker to getting ESR up and running again.
22 Mar 2021: 12:58 - reviewed Confluence docs.
22 Mar 2021: 13:06 - fire-fire called.
22 Mar 2021: 13:09 - 2hr fire-fire call with Liban Hirey (Unlicensed) driving.
22 Mar 2021: 14:40 - standard schedule of ESR file processing resumed successfully.

Action Items	Owner
Open a ticket with AWS to see if anything happened on their side	Liban Hirey (Unlicensed)
Check files in S3	Marcello Fabbri (Unlicensed)
Check applicants and notifications created today and match them up against the transactions	Reuben Roberts
More?	Who?

Sentry alerts were the only detection we had.
Still not really enough knowledge / confidence in the wider team with the ESR services.
The whole team were, however, quick to muck in and try to logically work through the fire-fire, despite the absence of Joseph (Pepe) Kelly Andy Dingley and John Simmons (Deactivated) (and others).
Team found it difficult to determine the implications of the failure, and therefore the priority of the response.
We have two tech leads that flagrantly flout the rules of booked annual leave! Many thanks, guys.
Do we feel that this was an edge case one-off failure, or is it an indication that we need to proactively encourage the team to become more familiar with the ESR services?