Non-technical Description
An issue occurred with an overnight task which meant users were only seeing one trainee in the revalidation app.
Trigger
As part of the overnight sync job, on calling the GMC’s SOAP endpoint
GetDoctorsForDB
our servicegmc-client-service
experienced an out of memory error and crashedThe message to trigger the sync job remained queued, and presumably kept re-triggering the error every time ECS spun up a new task
Detection
Resolution
Timeline
All times in BST unless indicated
: 00:05 gmc-client-service crashes attempting to run the overnight sync job due to a lack of memory
01:07 - 08:47 : The monitoring channel showed the task was stopping and being replaced.
08:53 : User reported (on Teams) revalidation module is showing one person under notice
09:30 : Stopped the 2-hourly checks of submitted recommendation, shortly after stopped the service temporarily to stop unhelpful logging
09:41 : Moved sync start messages to new queues for debugging
09:43 : Found logging to suggest incident started at 00:05 - around the time of the gmc sync job starting
~ 9:45 : Stopped gmc-client task on prod
~10:00 : Restarted gmc-client task on prod, observed the same debug logs (later appeared to be not relevant), task stopped again.
10:15 : Changed Log level for gmc-client (set to debug) and pushed to preprod
11:30 : Added
JAVA_TOOL_OPTIONS
in task definition, then updated memory from 512M to 2G. As part of deploying this change, the production issue became an issue for our preprod environment~ 11:30 Triggered GMC sync again on preprod. Failed due to memory error when making SOAP request to GMC
~12:15 Triggered GMC sync again on preprod after increasing memory allocation, this time it worked
~12:20 Identified separate issue with preprod regarding missing queues, reran jenkins build to restore them
~12:20 Triggered GMC sync again on prod after increasing memory allocation, this time it worked
~12:40 GMC sync appeared healthy on prod and doctors were appearing in connections
Root Cause(s)
Sudden inability to handle response from GMC’s
GetDoctorsForDB
SOAP endpoint apparently due to a lack of memory
Action Items
Action Items | Owner | |
---|---|---|
Contact GMC to verify nothing had changed with their endpoints (unlikely? but worth checking) | ||
Small tasks/tidy up:
|
Add Comment