Date |
|
Authors | Liban Hirey (Unlicensed) Reuben Roberts Marcello Fabbri (Unlicensed) |
Status | Resolved - conducting RCA |
Summary | Started receiving error messages in the #sentry-esr Slack channel |
Impact | ESR stopped working, no files would be processed from that point |
...
The ESR RabbitMQ default virtual host ('/') became inaccessible, so data-change messages could not be published or consumed. something Something happened that corrupted it, so on restart our services couldn’t reconnect properly.
...
#sentry-esr Slack channel received 3 “Channel shutdown: channel error; protocol method: #method<channel.close>(reply-code=406, reply-te...“ error messages on Monday morning.
...
Resolution
The VNC server used to access the RabbitMQ user interface was not responding therefore we had the created a new instance.
Restarting the Amazon MQ broker ('TIS-Prod_RabbitMQ') did not resolve the error. The RabbitMQ management panel displayed the message ‘Virtual host / experienced an error on node rabbit@localhost and may be inaccessible’.
It was not possible to successfully restart the RabbitMQ virtual host.
At this point, the ESR-Data_exporter component failed, with Maxwell CDC errors caused by failure to connect to the RabbitMQ instance (e.g.
2021-03-22 12:59:04 WARN An unexpected connection driver error occured (Exception message: Socket closed)
….Caused by: com.rabbitmq.client.ShutdownSignalException: connection error; protocol method: #method<connection.close>(reply-code=541, reply-text=INTERNAL_ERROR - access to vhost '/' refused for user 'tisprodrabbit': vhost '/' is down, class-id=10, method-id=40)
).It was decided to recreate the RabbitMQ virtual host, by deleting the old vhost and recreating it. It was not possible to rename the corrupted vhost, and creating a new vhost with a different name would have meant that a number of components would need to be redeployed with revised configurations, otherwise messages could be published to, or consumed from, the incorrect vhost queue. As such, it was deemed least disruptive to simply delete the corrupt vhost (which would delete any queued messages) in order to create a new one with the matching name of the original.
At this point, the message queueing recovered, and the TIS ESR components resumed normal operation. However, TIS data changes between the initial error (7:36AM) and this point (roughly 2:00PM) would not have been processed into files for uptake by ESR.
It was decided to ‘rewind’ Maxwell, the component which monitors database-level changes within TCS to publish CDC messages to RabbitMQ. This meant finding the statement sequence(?) timestampvalue from the MySQL binary data logs from at or just before the error occurred, and resetting the Maxwell ‘maxwell.positions’ configuration value to this point. This change must be made with the ESR-Data_exporter component temporarily stopped (?). Maxwell was hence replayed from 6:25AM, with an understanding that this would result in some duplication of messaging for changes occurring between 6:25AM and 7:36AM (when the original error was triggered), and intermittently between 2:00PM and 2:30PM during which time some message processing might occur.
No errors were noted for the Maxwell rewind and replay, and cursory examination of logs and MongoDB ESR Audit records seemed to indicate everything was operating correctly.
Follow-up action items were scoped to fully confirm that all TCS data changes for the day were correctly written to ESR export files.
Problem classed as resolved.
[ Liban Hirey (Unlicensed) I know there was a bit of work to redeploy(?) the container since SSH wasn’t working properly - I’m not sure if that was relevant to the fire-fire issue, or if it warrants another write-up?]
changed the CDC config to point to the new vhost
…
…
decided to reset Maxwell to this 06:25 this morning and replay (ensuring we don't miss processing changes, but introducing the smaller risk of creating duplicates created between 06.25 and when the vhost went down (07.36 - when AmazonMQ started erroring?)
…
…
stopped and restarted CDC (several times)
…
…
problem classed as resolved
...
Timeline
: 07:26 - 406 Channel shutdown errors on #sentry-esr.
: 09:11 - began investigating seriousness of these alerts.
: 12:35 - identified RabbitMQ virtual host as the blocker to getting ESR up and running again.
: 12:58 - reviewed Confluence docs.
: 13:06 - fire-fire called.
: 13:09 - 2hr fire-fire call with Liban Hirey (Unlicensed) driving.
: 14:40 - standard schedule of ESR file processing resumed successfully.
Root Cause(s)
...
The RabbitMQ vhost became corrupted and was unable to be restarted.
We have opened a support ticket with AWS and are currently awaiting a response from them to verify if anything happened on their side.
...
Action Items
Action Items | Owner |
---|---|
Open a ticket with AWS to see if anything happened on their side | |
Check files in S3 | |
Check applicants and notifications created today and match them up against the transactions | |
JFDI a cluster | |
Move to a service managed by AMQP/Rabbit specialists (https://www.cloudamqp.com/)? |
...
Sentry alerts were the only detection we had.
Still not really enough knowledge / confidence in the wider team with the ESR services.
The whole team were, however, quick to muck in and try to logically work through the fire-fire, despite the absence of Joseph (Pepe) Kelly Andy Dingley and John Simmons (Deactivated) (and others).
Team found it difficult to determine the implications of the failure, and therefore the priority of the response.
We have two tech leads that flagrantly flout the rules of booked annual leave! Many thanks, guys.
Do we feel that this was an edge case one-off failure, or is it an indication that we need to proactively encourage the team to become more familiar with the ESR services?