Date

22 Date

27 Jun 2023

Authors

Status

DocumentingFor development

Summary

Jira Legacy

server	System JIRA
serverId	4c843cd5-e5a9-329d-ae88-66091fcfe3c7
key	TIS21-46924707

Impact

Table of Contents

Non-technical Description

...

Trigger

As part of the overnight sync job, on calling the GMC’s SOAP endpoint GetDoctorsForDB our service gmc-client-service experienced an out of memory error and crashed
The message to trigger the sync job remained queued, and presumably kept re-triggering the error every time ECS spun up a new task

...

Detection

...

Also reported & impact confirmed by users
...

Resolution

Increased memory available to the task.

...

Timeline

All times in BST unless indicated

22 27 Jun 2023 00:05 : gmc-client-service crashes attempting to run the overnight sync job due to a lack of memory
22 Jun 2023 01:07 - 08:47 : The monitoring channel showed the task was stopping and being replaced.
22 Jun 2023 08:53 : User reported (on Teams) revalidation module is showing one person under notice
22 Jun 2023 09:30 : Stopped the 2-hourly checks of submitted recommendation, shortly after stopped the service temporarily to stop unhelpful logging
22 Jun 2023 09:41 : Moved sync start messages to new queues for debugging
22 Jun 2023 09:43 : Found logging to suggest incident started at 00:05 - around the time of the gmc sync job starting
22 Jun 2023 ~ 9:45 : Stopped gmc-client task on prod
22 Jun 2023 ~10:00 : Restarted gmc-client task on prod, observed the same debug logs (later appeared to be not relevant), task stopped again.
22 Jun 2023 10:15 : Changed Log level for gmc-client (set to debug) and pushed to preprod
22 Jun 2023 11:30 : Added JAVA_TOOL_OPTIONS in task definition, then updated memory from 512M to 2G. As part of deploying this change, the production issue became an issue for our preprod environment
22 Jun 2023 ~ 11:30 Triggered GMC sync again on preprod. Failed due to memory error when making SOAP request to GMC
22 Jun 2023 ~12:15 Triggered GMC sync again on preprod after increasing memory allocation, this time it worked
22 Jun 2023 ~12:20 Identified separate issue with preprod regarding missing queues, reran jenkins build to restore them
22 Jun 2023 ~12:20 Triggered GMC sync again on prod after increasing memory allocation, this time it worked
22 Jun 2023 ~12:40 GMC sync appeared healthy on prod and doctors were appearing in connections

5 Whys (or other analysis of Root Cause)

...

Why were no doctors showing in the revalidation recommendations and connections summary lists for most of the day? - Because the GMC overnight sync had failed

...

Why had the GMC overnight sync job failed? - Because the gmc-client-service kept crashing and restarting the sync process

...

Why did the gmc-client-service keep crashing? - Because it was experiencing an out of memory error every time it received a response from GMC
It kept on crashing on startup because ?!?

...

Why was the gmc-client-service experiencing an out of memory error every time it received a response from GMC? - current unknown

Why does it take so long for the GMC sync job to repopulate the doctors lists? - Because there’s a bottleneck in the CDC process (lambda)
27 Jun 2023

...

Action Items

TIS21-3271

Action Items	Owner	Comments
Reproduce error on preprod by spinning up task definition with less memory?	Cai Willis	There’s a few minutes lag between calling the sync endpoint and the sync message showing up in rabbit, be patient and don’t trigger it multiple times
Dynamic modification of task definition: memory & CPU?	💲 💲 💲 💲 💲 💲	Small tasks/tidy up: Reset cron schedules Make new (log level) parameters for environment specific?	Can we improve the speed of the overnight sync job (particularly the CDC process from MongoDB via. the `MasterDoctorIndex` to recommendations)	Joseph (Pepe) Kelly
Jira Legacy
server	System JIRA
serverId	4c843cd5-e5a9-329d-ae88-66091fcfe3c7
key

...

Versions Compared

Old Version 1

New Version 2

Key

Non-technical Description

...

Trigger

Detection

Also reported & impact confirmed by users
...

Resolution

Timeline

5 Whys (or other analysis of Root Cause)

Action Items

Lessons Learned

Page Comparison

Versions Compared

Old Version 1

New Version 2

Key

Non-technical Description

...

Trigger

Detection

Also reported & impact confirmed by users...

Resolution

Timeline

5 Whys (or other analysis of Root Cause)

Action Items

Lessons Learned

Also reported & impact confirmed by users
...