Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

An issue occurred with an overnight task which meant users were only seeing one trainee in the revalidation app.

...

Trigger

  • As part of the overnight sync job, on calling the GMC’s SOAP endpoint GetDoctorsForDB our service gmc-client-service experienced an out of memory error and crashed

  • The message to trigger the sync job remained queued, and presumably kept re-triggering the error every time ECS spun up a new task

...

Detection

...

Resolution

...

Timeline

All times in BST unless indicated

  • : : 00:05 gmc-client-service crashes attempting to run the overnight sync job due to a lack of memory

  • 01:07 - 08:47 : The monitoring channel showed the task was stopping and being replaced.

  • 08:53 : User reported (on Teams) revalidation module is showing one person under notice

  • 09:30 : Stopped the 2-hourly checks of submitted recommendation, shortly after stopped the service temporarily to stop unhelpful logging

  • 09:41 : Moved sync start messages to new queues for debugging

  • 09:43 : Found logging to suggest incident started at 00:05 - around the time of the gmc sync job starting

  • ~ 9:45 : Stopped gmc-client task on prod

  • ~10:00 : Restarted gmc-client task on prod, observed the same debug logs (later appeared to be not relevant), task stopped again.

  • 10:15 : Changed Log level for gmc-client (set to debug) and pushed to preprod

  • 11:30 : Added JAVA_TOOL_OPTIONS in task definition, then updated memory from 512M to 2G. As part of deploying this change, the production issue became an issue for our preprod environment

  • ~ 11:30 Triggered GMC sync again on preprod. Failed due to memory error when making SOAP request to GMC

  • ~12:15 Triggered GMC sync again on preprod after increasing memory allocation, this time it worked

  • ~12:20 Identified separate issue with preprod regarding missing queues, reran jenkins build to restore them

  • ~12:20 Triggered GMC sync again on prod after increasing memory allocation, this time it worked

  • ~12:40 GMC sync appeared healthy on prod and doctors were appearing in connections

Root Cause(s)

  • Sudden inability to handle response from GMC’s GetDoctorsForDB SOAP endpoint apparently due to a lack of memory

...

Action Items

Action Items

Owner

Contact GMC to verify nothing had changed with their endpoints (unlikely? but worth checking)

Small tasks/tidy up:

  • Reset cron schedules

  • Make new (log level) parameters for environment specific

...