2021-04-06 Profile service failure took down TIS
Date | Apr 6, 2021 |
Authors | @Andy Dingley @Marcello Fabbri (Unlicensed) |
Status | Done |
Summary | The TIS profile service went down |
Impact | TIS could not be used at all |
Non-technical Description
Our “Profile” service, which is used to check user permissions, went down due to a breaking change being deployed. As user permissions could not be checked the TIS application blocked all users actions, to a user this would have appeared like a log in failure.
Trigger
A change was deployed to
tis-profile
which caused the service to fail to start
Detection
Slack notification
Resolution
Reverted the breaking change
Timeline
Apr 6, 2021: 14:03 - Breaking change deployed to production.
Apr 6, 2021: 14:05 - Notification sent to slack channel for stage
#monitoring-prod
Apr 6, 2021: 14:10 - Notification sent to slack channel
#monitoring-prod
Apr 6, 2021: 14:11 - Issue picked up by dev team.
Apr 6, 2021: 14:17 - Fix deployed to production.
Root Cause(s)
Profile service failed to start
Change to the Sentry configuration caused a breaking change
The implemented Sentry configuration requires Spring Boot 2.1.0 and newer (Profile uses 1.5.2)
The build continued despite failures
The alert about stage going down (14:05) was obscured by other alerts.
…
Action Items
Action Items | Owner |
---|---|
Find out why the configuration made the service fail | @Marcello Fabbri (Unlicensed) |
Find a working solution to migrate to sentry-spring-boot-starter 4.3.0 without failures | @Marcello Fabbri (Unlicensed) |
Improve the profile pipeline, e.g.:
|
|
Upgrade spring-boot in profile Don’t think this is worth it atm; we’ll be moving to cognito. |
|
Lessons Learned
Test changes properly locally and on stage before pushing to production.
Slack: https://hee-nhs-tis.slack.com/
Jira issues: https://hee-tis.atlassian.net/issues/?filter=14213