2023-06-22 GMC Client failed and not able to start
neDate | Jun 22, 2023 |
Authors | @Joseph (Pepe) Kelly |
Status | Done |
Summary | |
Impact | Unable to see the lists of doctors connected and under notice for revalidation |
Non-technical Description
An issue occurred with an overnight task which meant users were only seeing one trainee in the revalidation search lists. It was continously restarting and retrying until we were able to give it additional resources to work with. It then was able to run and fill out the search lists for recommendations and connections.
Trigger
As part of the overnight sync job, on calling the GMC’s SOAP endpoint
GetDoctorsForDB
our servicegmc-client-service
experienced an out of memory error and crashedThe message to trigger the sync job remained queued, and presumably kept re-triggering the error every time ECS spun up a new task
Detection
Slack monitoring channel
Also reported & impact confirmed by users
Resolution
Increased memory available to the task.
Timeline
All times in BST unless indicated
Jun 22, 2023 00:05 : gmc-client-service crashes attempting to run the overnight sync job due to a lack of memory
Jun 22, 2023 01:07 - 08:47 : The monitoring channel showed the task was stopping and being replaced.
Jun 22, 2023 08:53 : User reported (on Teams) revalidation module is showing one person under notice
Jun 22, 2023 09:30 : Stopped the 2-hourly checks of submitted recommendation, shortly after stopped the service temporarily to stop unhelpful logging
Jun 22, 2023 09:41 : Moved sync start messages to new queues for debugging
Jun 22, 2023 09:43 : Found logging to suggest incident started at 00:05 - around the time of the gmc sync job starting
Jun 22, 2023 ~ 9:45 : Stopped gmc-client task on prod
Jun 22, 2023 ~10:00 : Restarted gmc-client task on prod, observed the same debug logs (later appeared to be not relevant), task stopped again.
Jun 22, 2023 10:15 : Changed Log level for gmc-client (set to debug) and pushed to preprod
Jun 22, 2023 11:30 : Added
JAVA_TOOL_OPTIONS
in task definition, then updated memory from 512M to 2G. As part of deploying this change, the production issue became an issue for our preprod environmentJun 22, 2023 ~ 11:30 Triggered GMC sync again on preprod. Failed due to memory error when making SOAP request to GMC
Jun 22, 2023 ~12:15 Triggered GMC sync again on preprod after increasing memory allocation, this time it worked
Jun 22, 2023 ~12:20 Identified separate issue with preprod regarding missing queues, reran jenkins build to restore them
Jun 22, 2023 ~12:20 Triggered GMC sync again on prod after increasing memory allocation, this time it worked
Jun 22, 2023 ~12:40 GMC sync appeared healthy on prod and doctors were appearing in connections
5 Whys (or other analysis of Root Cause)
Why were no doctors showing in the revalidation recommendations and connections summary lists for most of the day? - Because the GMC overnight sync had failed
Why had the GMC overnight sync job failed? - Because the
gmc-client-service
kept crashing and restarting the sync processWhy did the
gmc-client-service
keep crashing? - Because it was experiencing an out of memory error every time it received a response from GMC
It kept on crashing on startup because ?!?Why was the gmc-client-service experiencing an out of memory error every time it received a response from GMC? - current unknown
Why does it take so long for the GMC sync job to repopulate the doctors lists? - Because there’s a bottleneck in the CDC process (lambda)
Action Items
Action Items | Owner | Comments |
---|---|---|
Reproduce error on preprod by spinning up task definition with less memory? | @Cai Willis | There’s a few minutes lag between calling the sync endpoint and the sync message showing up in rabbit, be patient and don’t trigger it multiple times |
Dynamic modification of task definition: memory & CPU? |
|
|
Small tasks/tidy up:
| @Joseph (Pepe) Kelly | Schedules reset Currently no need for parameters being different between environments |
Can we improve the speed of the overnight sync job (particularly the CDC process from MongoDB via. the | @Joseph (Pepe) Kelly |
Lessons Learned
Related pages
Slack: https://hee-nhs-tis.slack.com/
Jira issues: https://hee-tis.atlassian.net/issues/?filter=14213