2022-03-04 TIS services disrupted
Date | Mar 4, 2022 |
Authors | @Andy Dingley |
Status | Done |
Summary | TIS services could not access reference data |
Impact | Some TIS functionality may was unavailable during the downtime |
Non-technical Description
Several of the TIS functions rely on access to reference data, such as Site, Title, Gender.
The service which provides the reference data was migrated to a new location, but not all of the existing TIS functions were informed of the new location.
Trigger
The
TIS-REFERENCE
service was stopped on stage, prod and NIMDTA environments (with the expectation that the ECS instance was already being used).
Detection
Sentry error received in
#sentry-esr
that theESR-NOTIFICATIONGENERATOR
service could not access trust codes.User notification that bulk upload was displaying error messages.
Resolution
TIS-REFERENCE
restarted on prod blue to restore functionality.Affected services re-deployed (
ESR-NOTIFICATIONGENERATOR
,TIS-CONCERNS
,TIS-PROFILE
andTIS-SYNC
).TIS-REFERENCE
stopped again on prod blue.
Timeline
Mar 4, 2022 16:00 -
TIS-REFERENCE
stopped on all environmentsMar 4, 2022 16:03 - ESR notification generator exception picked up by Sentry and sent as slack notification.
Mar 4, 2022 16:16 -
TIS-REFERENCE
restarted on prod blue.Mar 4, 2022 16:19 - Redeployments underway for
ESR-NOTIFICATIONGENERATOR
,TIS-CONCERNS
,TIS-PROFILE
andTIS-SYNC
Mar 4, 2022 18:15 - Last of the deployments completed
Mar 7, 2022 12:30 -
TIS-REFERENCE
stopped on prod blueMar 8, 2022 14:41 - Redeployed
TIS-GENERIC-UPLOAD
following a report that bulk upload was not working
Root Cause(s)
The load balancer config was updated to point to the new ECS instance of
TIS-REFERENCE
The load balancer changes were not deployed to
ESR-NOTIFICATIONGENERATOR
,TIS-CONCERNS
,TIS-GENERIC-UPLOAD
,TIS-PROFILE
andTIS-SYNC
Stopping the
TIS-REFERENCE
service on blue/green then meant that the service could not be found instead of using the ECS instance.
Action Items
Action Items | Owner |
---|---|
Include a more descriptive sub-task(s) for deploying load balancer changes on future ECS migration tickets | @Andy Dingley |
Lessons Learned
Ensure services aren’t missed when deploying changes
Decommissioning/migrating is a bit more challenging: could we validate no active use as we decommission other services?
The alert notification helped us pick up on the issue much earlier than we would have otherwise
Slack: https://hee-nhs-tis.slack.com/
Jira issues: https://hee-tis.atlassian.net/issues/?filter=14213