Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 9 Current »

Date

Authors

Cai Willis Yafang Deng Jayanta Saha Joseph (Pepe) Kelly

Status

Production no longer impacted

Summary

TIS21-3846 - Getting issue details... STATUS

Impact

The recommendations search page was not being updated for a number of hours through the day

Non-technical Description

The list on the revalidation recommendation doctors page is maintained in a searchable index by aggregating data from updates to TIS and GMC data we store ourselves. The mechanism for keeping this aggregated data updated had become blocked - so no updates were coming through to this list. The details pages for each doctor were unaffected and recommendations could be made without issue.


Trigger

  • GMC doctors that should have been under notice were not showing up in the “Under Notice” list.


Detection

  • Message from 2 users


Resolution

  • A “force redeployment” update of the recommendation service

  • Gave users links to details pages for doctors which could not be found


Timeline

BST unless otherwise stated

  • 02:26 Earliest identifiable point of “something going wrong” - still unknown

  • 02:26 to 08:07 - Queue to recommendation for ‘doctor view’ update built steadily to ~83K

  • 08:21 - First report in user channel

  • 12:07 - Picked up for investigation

  • 12:07 to 14:00ish - Checked database & ElasticSearch index

  • 13:00 - Checked the return list of GMC for north west

  • 14:00ish - Found messages in reval.queue.masterdoctorview.updated.recommendation didn’t get consumed

  • 14:14ish - Having found that there were “recommendation status checks” being processed from overlapping runs, we reduced the frequency of checking for updates on submitted recommendations

  • 16:00ish - Force a new start of recommendation service

  • 16:00ish - Identified that rabbitMq was reporting 0 consumers

  • 17:00ish to 18:00ish - Identified error in queue declaration. Raised, merged and pushed a PR

  • 18:00ish - Noticed that messages are briefly consumed on startup but number of consumers quickly drops to 0

  • 18:40ish - Final redeploy of recommendation service cleared out backlog and appeared to restore consumers stably

  • 18:40ish - Identified that there was still some discrepancies in the data between masterdoctorindex and recommendation index, decided to wait until after overnight doctor sync to do quick reindex

  • ~07:30 - Checked for doctors that were mentioned and found both were appearing

  • 09:08 - Informed users of reindex (brief downtime expected)

  • 11:09 - Reindex complete, service restored


Root Cause(s)

  • Doctors reported as not showing in the search list

  • ElasticSearch Index for Recommendation Service isn’t being updated

  • Large backlog of messages stuck on a queue for updating the index

  • Message Consumers disappeared but after the final aws ecs update-service --force-new-deployment dropped to one before going back up to 3

  • This may have been related to load & thread starvation.


Action Items

Action Items

Comments

Owner

Monitoring for queue depth, consumption or some other combined metric to say whether messages are being processed ‘acceptably’, e.g. Number of consumers, message rate while there are messages, message depth for a period of time / latency

Cai Willis , Joseph (Pepe) Kelly

TIS21-3916 - Getting issue details... STATUS

Could comparing ElasticSearch & DocumentDB be used to validate doctors are all saved as expected

Probably not enough value in doing this?

Test / Replicate in stage… by loading c.100K messages onto the queue that was affected to see if this was likely a cause of the defect

Yafang Deng

TIS21-3880 - Getting issue details... STATUS

Compare indexes/indices composite doctor view & recommendation view to check for the potential scale of the problem

Cai Willis

Can we get extra application insights / metrics about thread usage & other resources

Joseph (Pepe) Kelly

Consider a “drop & rebuild” of the index?

Will consider as part of TIS21-3416 - Getting issue details... STATUS


Lessons Learned

  • We need more monitoring on our rabbitmq queues and activity

  • No labels