Date	Nov 23, 2022
Authors	@Cai Willis @Yafang Deng @Jayanta Saha @Joseph (Pepe) Kelly
Status	Done
Summary	https://hee-tis.atlassian.net/browse/TIS21-3846
Impact	The recommendations search page was not being updated for a number of hours through the day

Non-technical Description

The list on the revalidation recommendation doctors page is maintained in a searchable index by aggregating data from updates to TIS and GMC data we store ourselves. The mechanism for keeping this aggregated data updated had become blocked - so no updates were coming through to this list. The details pages for each doctor were unaffected and recommendations could be made without issue.

Trigger

GMC doctors that should have been under notice were not showing up in the “Under Notice” list.

Detection

Message from 2 users

Resolution

A “force redeployment” update of the recommendation service
Gave users links to details pages for doctors which could not be found

Timeline

BST unless otherwise stated

Nov 23, 2022 02:26 Earliest identifiable point of “something going wrong” - still unknown
Nov 23, 2022 02:26 to 08:07 - Queue to recommendation for ‘doctor view’ update built steadily to ~83K
Nov 23, 2022 08:21 - First report in user channel
Nov 23, 2022 12:07 - Picked up for investigation
Nov 23, 2022 12:07 to 14:00ish - Checked database & ElasticSearch index
Nov 23, 2022 13:00 - Checked the return list of GMC for north west
Nov 23, 2022 14:00ish - Found messages in reval.queue.masterdoctorview.updated.recommendation didn’t get consumed
Nov 23, 2022 14:14ish - Having found that there were “recommendation status checks” being processed from overlapping runs, we reduced the frequency of checking for updates on submitted recommendations
Nov 23, 2022 16:00ish - Force a new start of recommendation service
Nov 23, 2022 16:00ish - Identified that rabbitMq was reporting 0 consumers
Nov 23, 2022 17:00ish to 18:00ish - Identified error in queue declaration. Raised, merged and pushed a PR
Nov 23, 2022 18:00ish - Noticed that messages are briefly consumed on startup but number of consumers quickly drops to 0
Nov 23, 2022 18:40ish - Final redeploy of recommendation service cleared out backlog and appeared to restore consumers stably
Nov 23, 2022 18:40ish - Identified that there was still some discrepancies in the data between masterdoctorindex and recommendation index, decided to wait until after overnight doctor sync to do quick reindex
Nov 24, 2022 ~07:30 - Checked for doctors that were mentioned and found both were appearing
Nov 24, 2022 09:08 - Informed users of reindex (brief downtime expected)
Nov 24, 2022 11:09 - Reindex complete, service restored

Root Cause(s)

Doctors reported as not showing in the search list
ElasticSearch Index for Recommendation Service isn’t being updated
Large backlog of messages stuck on a queue for updating the index
Message Consumers disappeared but after the final aws ecs update-service --force-new-deployment dropped to one before going back up to 3
This may have been related to load & thread starvation.

Action Items

Action Items	Comments	Owner

Action Items	Comments	Owner
Monitoring for queue depth, consumption or some other combined metric to say whether messages are being processed ‘acceptably’, e.g. Number of consumers, message rate while there are messages, message depth for a period of time / latency		@Cai Willis , @Joseph (Pepe) Kelly https://hee-tis.atlassian.net/browse/TIS21-3916
Could comparing ElasticSearch & DocumentDB be used to validate doctors are all saved as expected	Probably not enough value in doing this?
Test / Replicate in stage… by loading c.100K messages onto the queue that was affected to see if this was likely a cause of the defect		@Yafang Deng https://hee-tis.atlassian.net/browse/TIS21-3880
Compare indexes/indices composite doctor view & recommendation view to check for the potential scale of the problem		@Cai Willis
Can we get extra application insights / metrics about thread usage & other resources		@Joseph (Pepe) Kelly
Consider a “drop & rebuild” of the index?		Will consider as part of https://hee-tis.atlassian.net/browse/TIS21-3416

Lessons Learned

We need more monitoring on our rabbitmq queues and activity

TIS21 Confluence Space

2022-11-23 Recommendations List missing some doctors