Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Date

Authors

Andy Dingley

Status

Done

Summary

TIS services could not access reference data

Impact

Some TIS functionality may was unavailable during the downtime

Non-technical Description

Several of the TIS functions rely on access to reference data, such as Site, Title, Gender.
The service which provides the reference data was migrated to a new location, but not all of the existing TIS functions were informed of the new location.

...

Trigger

  • The TIS-REFERENCE service was stopped on stage, prod and NIMDTA environments (with the expectation that the ECS instance was already being used).

Detection

  • Sentry error received in #sentry-esr that the ESR-NOTIFICATIONGENERATOR service could not access trust codes.

  • User notification that bulk upload was displaying error messages.

...

Resolution

  • TIS-REFERENCE restarted on prod blue to restore functionality.

  • Affected services re-deployed (ESR-NOTIFICATIONGENERATOR, TIS-CONCERNS, TIS-PROFILE and TIS-SYNC).

  • TIS-REFERENCE stopped again on prod blue.

...

Timeline

  • 16:00 - TIS-REFERENCE stopped on all environments

  • 16:03 - ESR notification generator exception picked up by Sentry and sent as slack notification.

  • 16:16 - TIS-REFERENCE restarted on prod blue.

  • 16:19 - Redeployments underway for ESR-NOTIFICATIONGENERATOR, TIS-CONCERNS, TIS-PROFILE and TIS-SYNC

  • 18:15 - Last of the deployments completed

  • 12:30 - TIS-REFERENCE stopped on prod blue

  • 14:41 - Redeployed TIS-GENERIC-UPLOAD following a report that bulk upload was not working

...

Root Cause(s)

  • The load balancer config was updated to point to the new ECS instance of TIS-REFERENCE

  • The load balancer changes were not deployed to ESR-NOTIFICATIONGENERATOR, TIS-CONCERNS, TIS-GENERIC-UPLOAD, TIS-PROFILE and TIS-SYNC

  • Stopping the TIS-REFERENCE service on blue/green then meant that the service could not be found instead of using the ECS instance.

...

Action Items

Action Items

Owner

Include a more descriptive sub-task(s) for deploying load balancer changes on future ECS migration tickets

Andy Dingley

...

Lessons Learned

  • Ensure services aren’t missed when deploying changes

  • Decommissioning/migrating is a bit more challenging: could we validate no active use as we decommission other services?

  • The alert notification helped us pick up on the issue much earlier than we would have otherwise