Date | |
Authors | |
Status | Done |
Summary | TIS services could not access reference data |
Impact | Some TIS functionality may was unavailable during the downtime |
Non-technical Description
Several of the TIS functions rely on access to reference data, such as Site, Title, Gender.
The service which provides the reference data was migrated to a new location, but not all of the existing TIS functions were informed of the new location.
...
Trigger
The
TIS-REFERENCE
service was stopped on stage, prod and NIMDTA environments (with the expectation that the ECS instance was already being used).
Detection
Sentry error received in
#sentry-esr
that theESR-NOTIFICATIONGENERATOR
service could not access trust codes.User notification that bulk upload was displaying error messages.
...
Resolution
TIS-REFERENCE
restarted on prod blue to restore functionality.Affected services re-deployed (
ESR-NOTIFICATIONGENERATOR
,TIS-CONCERNS
,TIS-PROFILE
andTIS-SYNC
).TIS-REFERENCE
stopped again on prod blue.
...
Timeline
16:00 -
TIS-REFERENCE
stopped on all environments16:03 - ESR notification generator exception picked up by Sentry and sent as slack notification.
16:16 -
TIS-REFERENCE
restarted on prod blue.16:19 - Redeployments underway for
ESR-NOTIFICATIONGENERATOR
,TIS-CONCERNS
,TIS-PROFILE
andTIS-SYNC
18:15 - Last of the deployments completed
12:30 -
TIS-REFERENCE
stopped on prod blue14:41 - Redeployed
TIS-GENERIC-UPLOAD
following a report that bulk upload was not working
...
Root Cause(s)
The load balancer config was updated to point to the new ECS instance of
TIS-REFERENCE
The load balancer changes were not deployed to
ESR-NOTIFICATIONGENERATOR
,TIS-CONCERNS
,TIS-GENERIC-UPLOAD
,TIS-PROFILE
andTIS-SYNC
Stopping the
TIS-REFERENCE
service on blue/green then meant that the service could not be found instead of using the ECS instance.
...
Action Items
Action Items | Owner |
---|---|
Include a more descriptive sub-task(s) for deploying load balancer changes on future ECS migration tickets |
...
Lessons Learned
Ensure services aren’t missed when deploying changes
Decommissioning/migrating is a bit more challenging: could we validate no active use as we decommission other services?
The alert notification helped us pick up on the issue much earlier than we would have otherwise