2021-03-22 ESR message queue failure
Date | Mar 22, 2021 |
Authors | @Liban Hirey (Unlicensed) @Reuben Roberts @Marcello Fabbri (Unlicensed) |
Status | Done |
Summary | Started receiving error messages in the #sentry-esr Slack channel |
Impact | ESR stopped working, no files would be processed from that point |
Non-technical Description
Changes to TIS data that need to be relayed to ESR are published to a message queue to generate files for ESR to process. A fatal failure in the message-queueing component caused this to fail.
Deleting and recreating the message-queue configuration and then recreating the lost data-change messages by reprocessing the TIS data changes from just before the error occurred resolved the problem.
Trigger
The ESR RabbitMQ default virtual host ('/') became inaccessible, so data-change messages could not be published or consumed. Something happened that corrupted it, so on restart our services couldn’t reconnect properly.
Detection
#sentry-esr Slack channel received 3 “Channel shutdown: channel error; protocol method: #method<channel.close>(reply-code=406, reply-te...“ error messages on Monday morning.
Resolution
The VNC server used to access the RabbitMQ user interface was not responding therefore we had the created a new instance.
Restarting the Amazon MQ broker ('TIS-Prod_RabbitMQ') did not resolve the error. The RabbitMQ management panel displayed the message ‘Virtual host / experienced an error on node rabbit@localhost and may be inaccessible’.
It was not possible to successfully restart the RabbitMQ virtual host.
At this point, the ESR-Data_exporter component failed, with Maxwell CDC errors caused by failure to connect to the RabbitMQ instance (e.g.
2021-03-22 12:59:04 WARN An unexpected connection driver error occured (Exception message: Socket closed)
….Caused by: com.rabbitmq.client.ShutdownSignalException: connection error; protocol method: #method<connection.close>(reply-code=541, reply-text=INTERNAL_ERROR - access to vhost '/' refused for user 'tisprodrabbit': vhost '/' is down, class-id=10, method-id=40)
).It was decided to recreate the RabbitMQ virtual host, by deleting the old vhost and recreating it. It was not possible to rename the corrupted vhost, and creating a new vhost with a different name would have meant that a number of components would need to be redeployed with revised configurations, otherwise messages could be published to, or consumed from, the incorrect vhost queue. As such, it was deemed least disruptive to simply delete the corrupt vhost (which would delete any queued messages) in order to create a new one with the matching name of the original.
At this point, the message queueing recovered, and the TIS ESR components resumed normal operation. However, TIS data changes between the initial error (7:36AM) and this point (roughly 2:00PM) would not have been processed into files for uptake by ESR.
It was decided to ‘rewind’ Maxwell, the component which monitors database-level changes within TCS to publish CDC messages to RabbitMQ. This meant finding the timestamp value from the MySQL binary data logs from at or just before the error occurred, and resetting the Maxwell ‘maxwell.positions’ configuration value to this point. This change must be made with the ESR-Data_exporter component temporarily stopped (?). Maxwell was hence replayed from 6:25AM, with an understanding that this would result in some duplication of messaging for changes occurring between 6:25AM and 7:36AM (when the original error was triggered), and intermittently between 2:00PM and 2:30PM during which time some message processing might occur.
No errors were noted for the Maxwell rewind and replay, and cursory examination of logs and MongoDB ESR Audit records seemed to indicate everything was operating correctly.
Follow-up action items were scoped to fully confirm that all TCS data changes for the day were correctly written to ESR export files.
Problem classed as resolved.
Timeline
Mar 22, 2021: 07:26 - 406 Channel shutdown errors on #sentry-esr.
Mar 22, 2021: 09:11 - began investigating seriousness of these alerts.
Mar 22, 2021: 12:35 - identified RabbitMQ virtual host as the blocker to getting ESR up and running again.
Mar 22, 2021: 12:58 - reviewed Confluence docs.
Mar 22, 2021: 13:06 - fire-fire called.
Mar 22, 2021: 13:09 - 2hr fire-fire call with @Liban Hirey (Unlicensed) driving.
Mar 22, 2021: 14:40 - standard schedule of ESR file processing resumed successfully.
Root Cause(s)
The RabbitMQ vhost became corrupted and was unable to be restarted.
We have opened a support ticket with AWS and are currently awaiting a response from them to verify if anything happened on their side.
Action Items
Action Items | Owner |
---|---|
Open a ticket with AWS to see if anything happened on their side | @Liban Hirey (Unlicensed) |
Check files in S3 | @Marcello Fabbri (Unlicensed) |
Check applicants and notifications created today and match them up against the transactions | @Reuben Roberts |
JFDI a cluster | @Joseph (Pepe) Kelly DONE |
Move to a service managed by AMQP/Rabbit specialists (https://www.cloudamqp.com/)? |
|
Improve monitoring of managed Rabbit broker |
Lessons Learned
Sentry alerts were the only detection we had.
Still not really enough knowledge / confidence in the wider team with the ESR services.
The whole team were, however, quick to muck in and try to logically work through the fire-fire, despite the absence of @Joseph (Pepe) Kelly @Andy Dingley and @John Simmons (Deactivated) (and others).
Team found it difficult to determine the implications of the failure, and therefore the priority of the response.
We have two tech leads that flagrantly flout the rules of booked annual leave! Many thanks, guys.
Do we feel that this was an edge case one-off failure, or is it an indication that we need to proactively encourage the team to become more familiar with the ESR services?
Slack: https://hee-nhs-tis.slack.com/
Jira issues: https://hee-tis.atlassian.net/issues/?filter=14213