Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Date

19

neDate

Authors

Joseph (Pepe) Kelly

Status

Done

Summary

Jira Legacy
serverSystem JIRA
serverId4c843cd5-e5a9-329d-ae88-66091fcfe3c7
keyTIS21-4692

Impact

Unable to see the lists of doctors connected and under notice for revalidation

Table of Contents

Non-technical Description

The validation for the Time out of Training (TOOT) field was working incorrectly when a draft form was reloaded, preventing the user from progressing to the next page of the form (and consequently its submission).

Trigger

  • Trainee reported a problem with editing and submitting their Form R Part B, having previously cleared the TOOT field before saving it as a draft.

Detection

  • Email received by TSS support.

Resolution

...

An issue occurred with an overnight task which meant users were only seeing one trainee in the revalidation search lists. It was continously restarting and retrying until we were able to give it additional resources to work with. It then was able to run and fill out the search lists for recommendations and connections.

...

Trigger

  • As part of the overnight sync job, on calling the GMC’s SOAP endpoint GetDoctorsForDB our service gmc-client-service experienced an out of memory error and crashed

  • The message to trigger the sync job remained queued, and presumably kept re-triggering the error every time ECS spun up a new task

...

Detection

  • Slack monitoring channel

    • Also reported & impact confirmed by users

...

Resolution

  • Increased memory available to the task.

...

Timeline

All times in BST unless indicated

  • 19 00: 0905 : 08 - User email reported in the #self-services-support slack channel

  • : 09:30 - Work-around suggested.

  • : 12:08 - Fix applied to the code.

  • : 13:03 - Redeployment of tis-trainee-ui completed.

Root Cause(s)

  • User clears a TOOT field, saves the draft, reopens the form and attempts to navigate to the next section of the form, at which point the validation fails silently, preventing the form action.

  • That’s it, really.gmc-client-service crashes attempting to run the overnight sync job due to a lack of memory

  • 01:07 - 08:47 : The monitoring channel showed the task was stopping and being replaced.

  • 08:53 : User reported (on Teams) revalidation module is showing one person under notice

  • 09:30 : Stopped the 2-hourly checks of submitted recommendation, shortly after stopped the service temporarily to stop unhelpful logging

  • 09:41 : Moved sync start messages to new queues for debugging

  • 09:43 : Found logging to suggest incident started at 00:05 - around the time of the gmc sync job starting

  • ~ 9:45 : Stopped gmc-client task on prod

  • ~10:00 : Restarted gmc-client task on prod, observed the same debug logs (later appeared to be not relevant), task stopped again.

  • 10:15 : Changed Log level for gmc-client (set to debug) and pushed to preprod

  • 11:30 : Added JAVA_TOOL_OPTIONS in task definition, then updated memory from 512M to 2G. As part of deploying this change, the production issue became an issue for our preprod environment

  • ~ 11:30 Triggered GMC sync again on preprod. Failed due to memory error when making SOAP request to GMC

  • ~12:15 Triggered GMC sync again on preprod after increasing memory allocation, this time it worked

  • ~12:20 Identified separate issue with preprod regarding missing queues, reran jenkins build to restore them

  • ~12:20 Triggered GMC sync again on prod after increasing memory allocation, this time it worked

  • ~12:40 GMC sync appeared healthy on prod and doctors were appearing in connections

5 Whys (or other analysis of Root Cause)

  1. Why were no doctors showing in the revalidation recommendations and connections summary lists for most of the day? - Because the GMC overnight sync had failed

  2. Why had the GMC overnight sync job failed? - Because the gmc-client-service kept crashing and restarting the sync process

  3. Why did the gmc-client-service keep crashing? - Because it was experiencing an out of memory error every time it received a response from GMC
    It kept on crashing on startup because ?!?

  4. Why was the gmc-client-service experiencing an out of memory error every time it received a response from GMC? - current unknown

  • Why does it take so long for the GMC sync job to repopulate the doctors lists? - Because there’s a bottleneck in the CDC process (lambda)

...

Action Items

Action Items

Owner

Code fix

john o

DONE

Lessons Learned

...

Comments

Reproduce error on preprod by spinning up task definition with less memory?

Cai Willis

There’s a few minutes lag between calling the sync endpoint and the sync message showing up in rabbit, be patient and don’t trigger it multiple times (big grin)

Dynamic modification of task definition: memory & CPU?

💲 💲 💲 💲 💲 💲

Small tasks/tidy up:

  • Reset cron schedules

  • Make new (log level) parameters for environment specific?

Joseph (Pepe) Kelly

Schedules reset

Currently no need for parameters being different between environments

Can we improve the speed of the overnight sync job (particularly the CDC process from MongoDB via. the MasterDoctorIndex to recommendations)

Joseph (Pepe) Kelly

Jira Legacy
serverSystem JIRA
serverId4c843cd5-e5a9-329d-ae88-66091fcfe3c7
keyTIS21-3271

...

Lessons Learned