Date |
| |||||||||||
Authors | ||||||||||||
Status | Done | |||||||||||
Summary | The GMC sync job doesn’t update the data of index trainees in ES, so users might see some stale data.
| Impact |
| |||||||||
Impact | The recommendations search page was not being updated for a number of hours through the day |
Non-technical Description
The list on the revalidation recommendation doctors page is maintained in a searchable index by aggregating data from updates to TIS and GMC data we store ourselves. The mechanism for keeping this aggregated data updated had become blocked - so no updates were coming through to this list. The details pages for each doctor were unaffected and recommendations could be made without issue.
...
Trigger
GMC doctors that should have been under notice were not showing up in the “Under Notice” list.
...
Detection
Message from 2 users (assumed a similar / same root cause?)
...
Resolution
Get the index up to date
Give A “force redeployment” update of the recommendation service
Gave users links to details pages for doctors which could not be found
...
Timeline
BST unless otherwise stated
- 01:05 Slack alert on failure job
Checked database
ElasticSearch
Aug 30, 2022 - 09:07 Sync was started manually and it still failed
Aug 31, 2022 - 01:05 Slack alert on failure job
Aug 31, 2022 - 10:21 Found Sentry alerting of Profile service
Sep 01, 2022 - 01:05 Slack alert on failure job
Sep 01, 2022 - 11:39 PR was merged to remove the doctors with null GMC number from the list sent to Profile service, but didn’t work
Sep 01, 2022 - 12:01 Found the image of TIS-GMC-SYNC running on Prod was built 3 years ago
Sep 01, 2022 - 12:15 PR was merged to bump the build version of TIS-GMC-SYNC service on TIS-DEVOPS, but didn’t work
Sep 01, 2022 - 13:32 Image version issue was fixed on Prod, but Slack still alert on failure job
Sep 01, 2022 - 14:29 PR was merged to add debug logging for duplicate GMC numbers
Sep 01, 2022 - 15:01 TIS-GMC-SYNC job failed again, but duplicate GMC number was logged
Sep 01, 2022 - 15:30ish Duplicate GMC number was identified from Curl output
Sep 01, 2022 - 15:37 PR of adding debug logging was reverted
Sep 05, 2022 - 11:30ish Got reply from GMC and looked into the logs in the midnight. We still got duplicates. Rob sent a further email to GMC.
Sep 13, 2022 - 10:15ish We checked the logs and no duplicates found. https://hee-nhs-tis.slack.com/archives/C03GBMYGZD4/p1663060704518199?thread_ts=1663060383.191659&cid=C03GBMYGZD4
Sep 20, 2022 a sentry monitoring was added for duplicates from GMC.
Root Cause(s)
We have received duplicate doctors sent by GMC with same gmc ids for HEE South West (1-AIIDMQ). The GMC id is
7134553
. The doctors is being sent to us twice, which previously never happened and is not expected.Based on the assumption that GMC won’t send us duplicates, Profile service use GMC number as key of Map without filtering out duplicate GMC numbers, which causes the error02:26 Earliest identifiable point of “something going wrong” - still unknown
02:26 to 08:07 - Queue to recommendation for ‘doctor view’ update built steadily to ~83K
08:21 - First report in user channel
12:07 - Picked up for investigation
12:07 to 14:00ish - Checked database & ElasticSearch index
13:00 - Checked the return list of GMC for north west
14:00ish - Found messages in reval.queue.masterdoctorview.updated.recommendation didn’t get consumed
14:14ish - Having found that there were “recommendation status checks” being processed from overlapping runs, we reduced the frequency of checking for updates on submitted recommendations
16:00ish - Force a new start of recommendation service
16:00ish - Identified that rabbitMq was reporting 0 consumers
17:00ish to 18:00ish - Identified error in queue declaration. Raised, merged and pushed a PR
18:00ish - Noticed that messages are briefly consumed on startup but number of consumers quickly drops to 0
18:40ish - Final redeploy of recommendation service cleared out backlog and appeared to restore consumers stably
18:40ish - Identified that there was still some discrepancies in the data between masterdoctorindex and recommendation index, decided to wait until after overnight doctor sync to do quick reindex
~07:30 - Checked for doctors that were mentioned and found both were appearing
09:08 - Informed users of reindex (brief downtime expected)
11:09 - Reindex complete, service restored
...
Root Cause(s)
Doctors reported as not showing in the search list
ElasticSearch Index for Recommendation Service isn’t being updated
Large backlog of messages stuck on a queue for updating the index
Message Consumers disappeared but after the final
aws ecs update-service --force-new-deployment
dropped to one before going back up to 3This may have been related to load & thread starvation.
...
Action Items
Action Items | Comments | Owner | Send an email or message to check with GMC | Do we want to improve the CI CD process for TIS-GMC-SYNC? | The effort may not worthy to fix this ATM | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Investigate why the exceptions were not recorded in Profile Cloudwatch logs | We need a ticket to look into this. We came to the issue from SentryMonitoring for queue depth, consumption or some other combined metric to say whether messages are being processed ‘acceptably’, e.g. Number of consumers, message rate while there are messages, message depth for a period of time / latency | Cai Willis , Joseph (Pepe) Kelly
| |||||||||
|
| ||||||||||
Test / Replicate in stage… by loading c.100K messages onto the queue that was affected to see if this was likely a cause of the defect |
| ||||||||||
Check if this issue affects new revalidation | Do we need to double check what does the new Reval Sync do with the duplicates received from GMC? | ||||||||||
Isolate duplicates in TIS-GMC-SYNC service | Done | ||||||||||
Create Sentry alert for TIS-GMC-SYNC to capture logging when GMC sends duplicates | Done
| ||||||||||
Compare indexes/indices composite doctor view & recommendation view to check for the potential scale of the problem | |||||||||||
Can we get extra application insights / metrics about thread usage & other resources | |||||||||||
Consider a “drop & rebuild” of the index? | Will consider as part of
|
...
Lessons Learned
Add more debug logging if current logging are not enough to identify the cause.
Check Sentry if logs are not found as expected on Cloudwatch.We need more monitoring on our rabbitmq queues and activity