2022-03-04 TIS services disrupted
Date | Mar 4, 2022 |
Authors | @Andy Dingley |
Status | Done |
Summary | TIS services could not access reference data |
Impact | Some TIS functionality may was unavailable during the downtime |
Non-technical Description
Several of the TIS functions rely on access to reference data, such as Site, Title, Gender.
The service which provides the reference data was migrated to a new location, but not all of the existing TIS functions were informed of the new location.
Trigger
The
TIS-REFERENCE
service was stopped on stage, prod and NIMDTA environments (with the expectation that the ECS instance was already being used).
Detection
Sentry error received in
#sentry-esr
that theESR-NOTIFICATIONGENERATOR
service could not access trust codes.User notification that bulk upload was displaying error messages.
Resolution
TIS-REFERENCE
restarted on prod blue to restore functionality.Affected services re-deployed (
ESR-NOTIFICATIONGENERATOR
,TIS-CONCERNS
,TIS-PROFILE
andTIS-SYNC
).TIS-REFERENCE
stopped again on prod blue.
Timeline
Mar 4, 2022 16:00 -
TIS-REFERENCE
stopped on all environmentsMar 4, 2022 16:03 - ESR notification generator exception picked up by Sentry and sent as slack notification.
Mar 4, 2022 16:16 -
TIS-REFERENCE
restarted on prod blue.Mar 4, 2022 16:19 - Redeployments underway for
ESR-NOTIFICATIONGENERATOR
,TIS-CONCERNS
,TIS-PROFILE
andTIS-SYNC
Mar 4, 2022 18:15 - Last of the deployments completed
Mar 7, 2022 12:30 -
TIS-REFERENCE
stopped on prod blueMar 8, 2022 14:41 - Redeployed
TIS-GENERIC-UPLOAD
following a report that bulk upload was not working
Root Cause(s)
The load balancer config was updated to point to the new ECS instance of
TIS-REFERENCE
The load balancer changes were not deployed to
ESR-NOTIFICATIONGENERATOR
,TIS-CONCERNS
,TIS-GENERIC-UPLOAD
,TIS-PROFILE
andTIS-SYNC
Stopping the
TIS-REFERENCE
service on blue/green then meant that the service could not be found instead of using the ECS instance.
Action Items
Action Items | Owner |
---|---|
Include a more descriptive sub-task(s) for deploying load balancer changes on future ECS migration tickets | @Andy Dingley |
Lessons Learned
Ensure services aren’t missed when deploying changes
Decommissioning/migrating is a bit more challenging: could we validate no active use as we decommission other services?
The alert notification helped us pick up on the issue much earlier than we would have otherwise
Related pages
Slack: https://hee-nhs-tis.slack.com/
Jira issues: https://hee-tis.atlassian.net/issues/?filter=14213