2021-05-28 Rabbit authentication error

Date

May 28, 2021

Authors

@Marcello Fabbri (Unlicensed) @Liban Hirey (Unlicensed)

Status

Done

Summary

TCS service down due a rabbitMQ config error

Impact

TCS down

Non-technical Description

  • The TCS service fell over due an authentication error on rabbitMQ which was caused by an incorrect configuration value.


Trigger

  • A typing error when saving the password of the Reval rabbitMQ user in our parameter store.


Detection

  • Notification sent to #monitoring-prod.


Resolution

  • Updated the value of the Reval rabbitMQ user’s password in parameter store.


Timeline

  • May 28, 2021: 14:16 BST - First AuthenticationFailureException thrown.

  • May 28, 2021: 14:18 BST - Notification of TCS Health Check failure on Slack (#monitoring-prod).

  • May 28, 2021: 14:18 BST - Users start flagging the problem on Teams.

  • May 28, 2021: 14:24 BST - Issue identified as a Rabbit authentication error.

  • May 28, 2021: 14:30 BST - Typo in password rectified and TCS redeployed.

  • May 28, 2021: 14:30 BST - TCS stable again.

Root Cause(s)

  • Incorrect password set for the Reval rabbitMQ user in the parameter store


Action Items

Action Items

Owner

 

Action Items

Owner

 

n/a

 

 

 

 

 

 


Lessons Learned

  • Double check the config values being entered in parameter store or anywhere else.