Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Date

Authors

Joseph (Pepe) Kelly

Status

Resolved

Summary

Unable to see connected doctors on the “current connections” page

Impact

  1. A small number of doctors were not being updated from the GMC when they were connected outside of TIS Revalidation between 6th Oct 2nd November

  2. Errors were sometimes being returned for search lists and the details page for a doctor for up to 3 hours of Thurs 2nd Nov

Table of Contents

Non-technical Description

An A Revalidation Admin drew our attention to some data that did not appear to be correct. In early October, we corrected how we handle whether a doctor is “under notice” for revalidation. As we only receive partial information from the GMC, we removed stale and potentially incorrect data from a number of records and proactively remove unreliable data as we get some information. One of our services relied on always maintaining knowledge of whether or not a doctor was “under notice”, which was no longer possible so it failed to keep those doctors records updated.

This meant for a small number of doctors *** who were connected to a NHS E designated body outside of TIS Revalidation but after 5th October did not appear so until Friday 3rd November, the first sync after releasing a fix. While investigating the problem on 2nd November, we came across an intermittent issue loading pages. This was resolved in part initially and fully within 3 hours of being noticed by the team and within 2 hours of being apparent to reported by users.

e.g. what are we doing to fix itHaving addressed the bug, we have identified that there are other places in the code which can be improved in the same way. For the issue of partial failures leading to error messages, we will be putting in place the kind of monitoring that will enable us to detect when there is an issue for 1 user and improve our ability to respond before it becomes a more widespread problem.

...

Trigger

  • Updating under notice

  • Service Degradation?

Detection

  • User query on Teams

Resolution

  • Bugfix released 2nd Nov pm

  • Redeploying ?recommendation? service: tasks that were taking a long time to respond

...

Timeline

All times in BST unless indicated

  • : Change released to correct data and processing for whether doctors are “under notice”.

  • ~00:10 : Logs show that there are approx. 2 “Null Pointer Exception”s that mean we don’t 2 doctors as connected.

  • : 18:03 User queries connection status - Trainee is showing connected on GMC Connect but not TIS Revalidation.

  • : 10:47 First responder requested when they were last connected via. GMC Connect?.

  • : 10:57 Reported that trainee are not even in the connections list on TIS but are on GMC connect - 18/10/2023.

  • : 11:04 reported that 7562969 another doctor was connected yesterday the day before but remains on the discrepancy list ? is this a different issue to the one listed above?as if they weren’t connected. It initially appeared to be different to the other doctor listed above.

  • : 11:40 Responder notified that the last time their records in revalidation were updated were the 18/10 & yesterday(1/11/23)

  • : Unable to manual get doctors for debugging failed: Request was blocked, possibly because of earlier bad requests.

  • : Manual verification of GMC responses showed 1 doctor was not in the list of connected doctors but the other was.

  • : Further debugging identified the cause.

  • : While investing, errors in the Reval app point to another issue: Gateway timeouts some of the time

  • : Investigation

  • 11:35 : replaced recommendation tasks in production

  • 11:53 : replaced connection tasks in production

  • 12:06 : replaced integration & core tasks in production

  • :

Root Cause(s)

Doctors weren’t appearing as Connected because they were marked as existsInGmc=false and no connected Designated Body but were in GMC Connect as connected.

...

The connections service sometimes didn’t respond in the allowable time, possibly because it couldn’t be reached. There were no indications of a HTTP 504 in our services? We note that the correlation with “Unhealthy Routing Flow Count” for 2 of our 3 availability zones. This is why some actions were successful.

...

Action Items

Action Items

Owner

Alert on “Unhealthy Routing Flow Count”

catherine.odukale (Unlicensed)

Story

Could the error be more friendly… e.g. Timeouts “retry & contact if it keeps happening”

Conversation facilitated bycatherine.odukale (Unlicensed)

Refine/ Possible Story

Extend/Improve reach of X-Ray service to better detect the location of failures

catherine.odukale (Unlicensed)

Story

Review Sentry and mark issues appropriately so we are alerted

Joseph (Pepe) Kelly Now…

Done with more cautious alerting & marking old issues resolved

Use Mapstruct through Reval services

catherine.odukale (Unlicensed)

Story

...

Lessons Learned