2021-04-06 Profile service failure took down TIS

Date

Apr 6, 2021

Authors

@Andy Dingley @Marcello Fabbri (Unlicensed)

Status

Done

Summary

The TIS profile service went down

Impact

TIS could not be used at all

Non-technical Description

Our “Profile” service, which is used to check user permissions, went down due to a breaking change being deployed. As user permissions could not be checked the TIS application blocked all users actions, to a user this would have appeared like a log in failure.


Trigger

  • A change was deployed to tis-profile which caused the service to fail to start


Detection

  • Slack notification


Resolution

  • Reverted the breaking change


Timeline

  • Apr 6, 2021: 14:03 - Breaking change deployed to production.

  • Apr 6, 2021: 14:05 - Notification sent to slack channel for stage #monitoring-prod

  • Apr 6, 2021: 14:10 - Notification sent to slack channel #monitoring-prod

  • Apr 6, 2021: 14:11 - Issue picked up by dev team.

  • Apr 6, 2021: 14:17 - Fix deployed to production.

Root Cause(s)

  • Profile service failed to start

    • Change to the Sentry configuration caused a breaking change

      • The implemented Sentry configuration requires Spring Boot 2.1.0 and newer (Profile uses 1.5.2)

    • The build continued despite failures

    • The alert about stage going down (14:05) was obscured by other alerts.


Action Items

Action Items

Owner

Action Items

Owner

Find out why the configuration made the service fail

@Marcello Fabbri (Unlicensed)

Find a working solution to migrate to sentry-spring-boot-starter 4.3.0 without failures

@Marcello Fabbri (Unlicensed)

Improve the profile pipeline, e.g.:

  • Health checks

  • Integration Tests

  • GHA?

  • ECS?

 

Upgrade spring-boot in profile Don’t think this is worth it atm; we’ll be moving to cognito.

 


Lessons Learned

  • Test changes properly locally and on stage before pushing to production.