2021-05-28 Rabbit authentication error
Date | May 28, 2021 |
Authors | @Marcello Fabbri (Unlicensed) @Liban Hirey (Unlicensed) |
Status | Done |
Summary | TCS service down due a rabbitMQ config error |
Impact | TCS down |
Non-technical Description
The TCS service fell over due an authentication error on rabbitMQ which was caused by an incorrect configuration value.
Trigger
A typing error when saving the password of the Reval rabbitMQ user in our parameter store.
Detection
Notification sent to #monitoring-prod.
Resolution
Updated the value of the Reval rabbitMQ user’s password in parameter store.
Timeline
May 28, 2021: 14:16 BST - First
AuthenticationFailureException
thrown.May 28, 2021: 14:18 BST - Notification of TCS Health Check failure on Slack (#monitoring-prod).
May 28, 2021: 14:18 BST - Users start flagging the problem on Teams.
May 28, 2021: 14:24 BST - Issue identified as a Rabbit authentication error.
May 28, 2021: 14:30 BST - Typo in password rectified and TCS redeployed.
May 28, 2021: 14:30 BST - TCS stable again.
Root Cause(s)
Incorrect password set for the Reval rabbitMQ user in the parameter store
Action Items
Action Items | Owner |
|
---|---|---|
n/a |
|
|
|
|
|
|
|
|
Lessons Learned
Double check the config values being entered in parameter store or anywhere else.
Slack: https://hee-nhs-tis.slack.com/
Jira issues: https://hee-tis.atlassian.net/issues/?filter=14213