Date |
|
Authors | Liban Hirey (Unlicensed) Reuben Roberts Marcello Fabbri (Unlicensed) |
Status | Resolved - conducting RCA |
Summary | Started receiving error messages in the #sentry-esr Slack channel |
Impact | ESR stopped working, no files would be processed from that point |
...
The ESR RabbitMQ default virtual host ('/') became inaccessible, so data-change messages could not be published or consumed. Something happened that corrupted it, so on restart our services couldn’t reconnect properly.
...
The VNC server used to access the RabbitMQ user interface was not responding therefore we had the created a new instance.
Restarting the Amazon MQ broker ('TIS-Prod_RabbitMQ') did not resolve the error. The RabbitMQ management panel displayed the message ‘Virtual host / experienced an error on node rabbit@localhost and may be inaccessible’.
It was not possible to successfully restart the RabbitMQ virtual host.
At this point, the ESR-Data_exporter component failed, with Maxwell CDC errors caused by failure to connect to the RabbitMQ instance (e.g.
2021-03-22 12:59:04 WARN An unexpected connection driver error occured (Exception message: Socket closed)
….Caused by: com.rabbitmq.client.ShutdownSignalException: connection error; protocol method: #method<connection.close>(reply-code=541, reply-text=INTERNAL_ERROR - access to vhost '/' refused for user 'tisprodrabbit': vhost '/' is down, class-id=10, method-id=40)
).It was decided to recreate the RabbitMQ virtual host, by deleting the old vhost and recreating it. It was not possible to rename the corrupted vhost, and creating a new vhost with a different name would have meant that a number of components would need to be redeployed with revised configurations, otherwise messages could be published to, or consumed from, the incorrect vhost queue. As such, it was deemed least disruptive to simply delete the corrupt vhost (which would delete any queued messages) in order to create a new one with the matching name of the original.
At this point, the message queueing recovered, and the TIS ESR components resumed normal operation. However, TIS data changes between the initial error (7:36AM) and this point (roughly 2:00PM) would not have been processed into files for uptake by ESR.
It was decided to ‘rewind’ Maxwell, the component which monitors database-level changes within TCS to publish CDC messages to RabbitMQ. This meant finding the timestampvalue from the MySQL binary data logs from at or just before the error occurred, and resetting the Maxwell ‘maxwell.positions’ configuration value to this point. This change must be made with the ESR-Data_exporter component temporarily stopped (?). Maxwell was hence replayed from 6:25AM, with an understanding that this would result in some duplication of messaging for changes occurring between 6:25AM and 7:36AM (when the original error was triggered), and intermittently between 2:00PM and 2:30PM during which time some message processing might occur.
No errors were noted for the Maxwell rewind and replay, and cursory examination of logs and MongoDB ESR Audit records seemed to indicate everything was operating correctly.
Follow-up action items were scoped to fully confirm that all TCS data changes for the day were correctly written to ESR export files.
Problem classed as resolved.
...
Action Items | Owner |
---|---|
Open a ticket with AWS to see if anything happened on their side | |
Check files in S3 | |
Check applicants and notifications created today and match them up against the transactions | |
JFDI a cluster | Joseph (Pepe) Kelly DONE |
Move to a service managed by AMQP/Rabbit specialists (https://www.cloudamqp.com/)? | |
Improve monitoring of managed Rabbit broker |
...
Lessons Learned
Sentry alerts were the only detection we had.
Still not really enough knowledge / confidence in the wider team with the ESR services.
The whole team were, however, quick to muck in and try to logically work through the fire-fire, despite the absence of Joseph (Pepe) Kelly Andy Dingley and John Simmons (Deactivated) (and others).
Team found it difficult to determine the implications of the failure, and therefore the priority of the response.
We have two tech leads that flagrantly flout the rules of booked annual leave! Many thanks, guys.
Do we feel that this was an edge case one-off failure, or is it an indication that we need to proactively encourage the team to become more familiar with the ESR services?