Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Next »

Date

Authors

Marcello Fabbri (Unlicensed) Liban Hirey (Unlicensed)

Status

Done

Summary

TCS service down due a rabbitMQ config error

Impact

TCS down

Non-technical Description

  • The TCS service fell over due an authentication error on rabbitMQ which was caused by an incorrect configuration value.


Trigger

  • A typing error when saving the password of the Reval rabbitMQ user in our parameter store.


Detection

  • Notification sent to #monitoring-prod.


Resolution

  • Updated the value of the Reval rabbitMQ user’s password in parameter store.


Timeline

  • : 14:16 BST - First AuthenticationFailureException thrown.

  • : 14:18 BST - Notification of TCS Health Check failure on Slack (#monitoring-prod).

  • : 14:18 BST - Users start flagging the problem on Teams.

  • : 14:24 BST - Issue identified as a Rabbit authentication error.

  • : 14:30 BST - Typo in password rectified and TCS redeployed.

  • : 14:30 BST - TCS stable again.

Root Cause(s)

  • Incorrect password set for the Reval rabbitMQ user in the parameter store


Action Items

Action Items

Owner

n/a


Lessons Learned

  • No labels

0 Comments

You are not logged in. Any changes you make will be marked as anonymous. You may want to Log In if you already have an account.