Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • :

  • :

  • 01:07 - 08:47 : The monitoring channel showed the task was stopping and being replaced.

  • 08:53 : User reported (on Teams) revalidation module is showing one person under notice

  • 09:30 : Stopped the 2-hourly checks of submitted recommendation, shortly after stopped the service temporarily to stop unhelpful logging

  • 09:41 : Moved sync start messages to new queues for debugging

  • 09:43 : Found logging to suggest incident started at 00:05 - around the time of the gmc sync job starting

  • ~ 9:45 : Stopped gmc-client task on prod

  • ~10:00 : Restarted gmc-client task on prod, observed the same issuedebug logs (later appeared to be not relevant), task stopped again.

  • 10:15 : Changed Log level for gmc-client (set to debug) and pushed to preprod

  • 11:30 : Added JAVA_TOOL_OPTIONS in task definition, then updated memory from 512M to 2G. As part of deploying this change, the production issue became an issue for our preprod environment

  • 12 ~ 11:10 30 Triggered GMC sync again on preprod. Failed due to memory error when making SOAP request to GMC

  • ~12:15 Triggered GMC sync again on preprod after increasing memory allocation, this time it worked

  • ~12:20 Identified separate issue with preprod regarding missing queues, reran jenkins build to restore them

  • ~12:20 Triggered GMC sync again on prod after increasing memory allocation, this time it worked

  • ~12:40 GMC sync appeared healthy on prod and doctors were appearing in connections

Root Cause(s)

...

Action Items

Action Items

Owner

Small tasks/tidy up:

  • Reset cron schedules

  • Make new (log level) parameters for environment specific

...

Lessons Learned