Date |
|
Authors | Liban Hirey (Unlicensed) Reuben Roberts Marcello Fabbri (Unlicensed) |
Status | Resolved - conducting RCA |
Summary | Started receiving error messages in the #sentry-esr Slack channel |
Impact | ESR stopped working, no files would be processed from that point |
Non-technical Description
Fatal failure in the ESR vhost (which enables shared hosting of services - ESR, like TIS, is made up of many separate services) caused ESR processes to fail. Deleting and recreating the vhost, and then re-configuring and re-running everything resolved the problem.
Trigger
vhost ‘went down' (something happened that corrupted it, so on restart it couldn’t do so properly).
Detection
#sentry-esr Slack channel received 3 “Channel shutdown: channel error; protocol method: #method<channel.close>(reply-code=406, reply-te...“ error messages on Monday morning.
Resolution
tried restarting the broker.
restarted vhost.
between those two events, the CDC went down.
initial errors, above, were resolved, but a new set of errors were then created: SocketException (4) and CannotCreateTransactionException (2) followed by AmqpIOException (1), another SocketException (1), finally an IllegalStateException (1) and a AmqpRejectAndDontRequeueException (1).
decided to recreate the vhost, by deleting the old and creating a new (couldn’t rename the corrupted one to “vhost_old” or anything, so had to delete it in order to create a new one with the matching name of the original.
changed the CDC config to point to the new vhost
felt there was a risk we start pushing messages to the wrong place
…
…
decided to reset Maxwell to this 06:25 this morning and replay (ensuring we don't miss processing changes, but introducing the smaller risk of creating duplicates created between 06.25 and when the vhost went down (07.36 - when AmazonMQ started erroring?)
…
…
stopped and restarted CDC (several times)
…
…
problem classed as resolved
Timeline
: 07:26 - 406 Channel shutdown errors on #sentry-esr.
: 09:11 - began investigating seriousness of these alerts.
: 12:35 - identified vhost as the blocker to getting ESR up and running again.
: 12:58 - reviewed Confluence docs.
: 13:06 - fire-fire called.
: 13:09 - 2hr fire-fire call with Liban Hirey (Unlicensed) driving.
: 14:40 - standard schedule of ESR file processing resumed successfully.
Root Cause(s)
TBC.
Action Items
Action Items | Owner |
---|---|
Open a ticket with AWS to see if anything happened on their side | |
Check files in S3 | |
Check applicants and notifications created today and match them up against the transactions | |
More? | Who? |
Lessons Learned
Sentry alerts were the only detection we had.
Still not really enough knowledge / confidence in the wider team with the ESR services.
The whole team were, however, quick to muck in and try to logically work through the fire-fire, despite the absence of Joseph (Pepe) Kelly Andy Dingley and John Simmons (Deactivated) (and others).
Team found it difficult to determine the implications of the failure, and therefore the priority of the response.
We have two tech leads that flagrantly flout the rules of booked annual leave! Many thanks, guys.
Do we feel that this was an edge case one-off failure, or is it an indication that we need to proactively encourage the team to become more familiar with the ESR services?
0 Comments