Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Date

Authors

catherine.odukale (Unlicensed) Joseph (Pepe) Kelly

Status

Documenting

Jira Legacy
serverSystem JIRA
serverId4c843cd5-e5a9-329d-ae88-66091fcfe3c7
keyTIS21-5454

Summary

Display error message on Reval ‘under notice’ (oops something went wrong). A service deployment affected the system. The service was given more processing resources and deploy successfully.

Impact

For approximately 90 minutes, some requests to do anything with recommendations (including doctor’s details) didn’t work

Table of Contents

Non-technical Description

Users on Revalidation were experiencing/receiving error messages, e.g. when searching for ‘Under Notice’ doctors. No records were showing. This is because a request to replace the tasks for the recommendation feature, called a “deployment”, was mistakenly bringing in tasks that were being identified as healthy before they were ready to take user’s requests.

A little over an hour after the deployment began, the service did eventually reach a point were the tasks that were taking requests did fully start up before receiving requests. The TIS team have been giving their focus to improvements in the application defined as higher priority so are yet to address the way that tasks are identified as healthy or unhealthy. As part of another incident, more resource was made available to the tasks and that meant that they are ready to take requests much quicker.

Trigger

Re-deployment

Detection

User Notification on Slack

...

Resolution

Service reached stability.

...

Timeline

All times in BST GMT unless indicated

  • 09:33am A deployment was triggered. 2/3 tasks reported “unhealthy” in load balancer monitoring. This was up and down until 10:30 when it remained at 2 unhealthy tasks.

  • 10:02am Several Logs of issues retrieving user’s profile information

  • 10:30am User reported with TIS Reval with getting constant error messages (oops something went wrong)

  • 10:32am another user reported that system is very slow and my under notice list just produced same error message as above

  • 11:44am first responder reported issue was being investigated and likely now resolved ? Recommendation service has reached a steady state.

  • 15:26pm User reported all working fine now.

Root Cause

  • The Ooops message was being displayed

  • The recommendation service was returning errors

  • Unhealthy tasks were being used

  • Tasks were repeatedly started and failing as part of a deployment

Action Items

There are a number of actions outstanding based on a similar occurrence which are yet to reach the top of the backlog

I

Action Items

Owner

Increased service resources (CPU): Tasks now start more quickly

Joseph (Pepe) Kelly

DONE

Additionally, dependant on service provider support responses, we will recreate the service

Won’t do.

Cards are still outstanding for improving the observability of the services

Joseph (Pepe) Kelly

Jira Legacy
serverSystem JIRA
serverId4c843cd5-e5a9-329d-ae88-66091fcfe3c7
keyTIS21-5282

Jira Legacy
serverSystem JIRA
serverId4c843cd5-e5a9-329d-ae88-66091fcfe3c7
keyTIS21-4864

Jira Legacy
serverSystem JIRA
serverId4c843cd5-e5a9-329d-ae88-66091fcfe3c7
keyTIS21-5310

...

Lessons Learned