Date | 22 Date |
| |||||||
Authors | |||||||||
Status | DocumentingFor development | ||||||||
Summary |
| ||||||||
Impact |
Table of Contents |
---|
Non-technical Description
...
...
Trigger
As part of the overnight sync job, on calling the GMC’s SOAP endpoint
GetDoctorsForDB
our servicegmc-client-service
experienced an out of memory error and crashedThe message to trigger the sync job remained queued, and presumably kept re-triggering the error every time ECS spun up a new task
...
Detection
...
Also reported & impact confirmed by users
...
Resolution
Increased memory available to the task.
...
Timeline
All times in BST unless indicated
22 00:05 : gmc-client-service crashes attempting to run the overnight sync job due to a lack of memory
01:07 - 08:47 : The monitoring channel showed the task was stopping and being replaced.
08:53 : User reported (on Teams) revalidation module is showing one person under notice
09:30 : Stopped the 2-hourly checks of submitted recommendation, shortly after stopped the service temporarily to stop unhelpful logging
09:41 : Moved sync start messages to new queues for debugging
09:43 : Found logging to suggest incident started at 00:05 - around the time of the gmc sync job starting
~ 9:45 : Stopped gmc-client task on prod
~10:00 : Restarted gmc-client task on prod, observed the same debug logs (later appeared to be not relevant), task stopped again.
10:15 : Changed Log level for gmc-client (set to debug) and pushed to preprod
11:30 : Added
JAVA_TOOL_OPTIONS
in task definition, then updated memory from 512M to 2G. As part of deploying this change, the production issue became an issue for our preprod environment~ 11:30 Triggered GMC sync again on preprod. Failed due to memory error when making SOAP request to GMC
~12:15 Triggered GMC sync again on preprod after increasing memory allocation, this time it worked
~12:20 Identified separate issue with preprod regarding missing queues, reran jenkins build to restore them
~12:20 Triggered GMC sync again on prod after increasing memory allocation, this time it worked
~12:40 GMC sync appeared healthy on prod and doctors were appearing in connections
5 Whys (or other analysis of Root Cause)
...
Why were no doctors showing in the revalidation recommendations and connections summary lists for most of the day? - Because the GMC overnight sync had failed
...
Why had the GMC overnight sync job failed? - Because the gmc-client-service
kept crashing and restarting the sync process
...
Why did the gmc-client-service
keep crashing? - Because it was experiencing an out of memory error every time it received a response from GMC
It kept on crashing on startup because ?!?
...
Why was the gmc-client-service experiencing an out of memory error every time it received a response from GMC? - current unknown
Why does it take so long for the GMC sync job to repopulate the doctors lists? - Because there’s a bottleneck in the CDC process (lambda)
...
Action Items
Action Items | Owner | Comments | |||
---|---|---|---|---|---|
Reproduce error on preprod by spinning up task definition with less memory? | There’s a few minutes lag between calling the sync endpoint and the sync message showing up in rabbit, be patient and don’t trigger it multiple times | ||||
Dynamic modification of task definition: memory & CPU? | 💲 💲 💲 💲 💲 💲 | Small tasks/tidy up:
| Can we improve the speed of the overnight sync job (particularly the CDC process from MongoDB via. the | ||
Jira Legacy | |||||
server | System JIRA | ||||
serverId | 4c843cd5-e5a9-329d-ae88-66091fcfe3c7 | key | TIS21-3271|||
...