Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Resolved

Date

Authors

Marcello Fabbri (Unlicensed) Liban Hirey (Unlicensed)

Status

Done

Summary

TCS service down due a rabbitMQ config error

Impact

TCS down

Table of Contents

Non-technical Description

  • The TCS service fell over due an authentication error on rabbitMQ which was caused by an incorrect configuration value.

...

Trigger

  • A typing error when saving the password of the Reval rabbitMQ user in our parameter store.

...

Detection

  • Notification sent to #monitoring-prod.

...

Resolution

  • Updated the value of the Reval rabbitMQ user’s password in parameter store.

...

Timeline

  • : 14:16 BST - First AuthenticationFailureException thrown.

  • : 14:18 BST - Notification of TCS Health Check failure on Slack (#monitoring-prod).

  • : 14:18 BST - Users start flagging the problem on Teams.

  • : 14:24 BST - Issue identified as a Rabbit authentication error.

  • : 14:30 BST - Typo in password rectified and TCS redeployed.

  • : 14:30 BST - TCS stable again.

Root Cause(s)

  • Incorrect password set for the Reval rabbitMQ user in the parameter store

...

Action Items

Action Items

Owner

n/a

...

Lessons Learned

  • Double check the config values being entered in parameter store or anywhere else.