Non-technical Description
An issue occurred with an overnight task which meant users were only seeing one trainee in the revalidation app.
Trigger
Detection
Resolution
Timeline
All times in BST unless indicated
:
:
01:07 - 08:47 : The monitoring channel showed the task was stopping and being replaced.
08:53 : User reported (on Teams) revalidation module is showing one person under notice
09:30 : Stopped the 2-hourly checks of submitted recommendation, shortly after stopped the service temporarily to stop unhelpful logging
09:41 : Moved sync start messages to new queues for debugging
09:43 : Found logging to suggest incident started at 00:05 - around the time of the gmc sync job starting
~ 9:45 : Stopped gmc-client task on prod
~10:00 : Restarted gmc-client task on prod, observed the same debug logs (later appeared to be not relevant), task stopped again.
10:15 : Changed Log level for gmc-client (set to debug) and pushed to preprod
11:30 : Added
JAVA_TOOL_OPTIONS
in task definition, then updated memory from 512M to 2G. As part of deploying this change, the production issue became an issue for our preprod environment~ 11:30 Triggered GMC sync again on preprod. Failed due to memory error when making SOAP request to GMC
~12:15 Triggered GMC sync again on preprod after increasing memory allocation, this time it worked
~12:20 Identified separate issue with preprod regarding missing queues, reran jenkins build to restore them
~12:20 Triggered GMC sync again on prod after increasing memory allocation, this time it worked
~12:40 GMC sync appeared healthy on prod and doctors were appearing in connections
Root Cause(s)
Action Items
Action Items | Owner | |
---|---|---|
Small tasks/tidy up:
|
0 Comments