Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 5 Next »

Date

Authors

Cai Willis Yafang Deng Jayanta Saha Joseph (Pepe) Kelly

Status

Production no longer impacted

Summary

TIS21-3846 - Getting issue details... STATUS

Impact

The recommendations search page was not being updated for a number of hours through the day

Non-technical Description

The list on the revalidation recommendation doctors page is maintained in a searchable index by aggregating data from updates to TIS and GMC data we store ourselves. The mechanism for keeping this aggregated data updated had become blocked - so no updates were coming through to this list. The details pages for each doctor were unaffected and recommendations could be made without issue.


Trigger

  • GMC doctors that should have been under notice were not showing up


Detection

  • Message from 2 users


Resolution

  • A “force redeployment” update of the recommendation service

  • Gave users links to details pages for doctors which could not be found


Timeline

BST unless otherwise stated

  • 02:26 Earliest identifiable point of “something going wrong” - still unknown

  • 02:26 to 08:07 - Queue to recommendation for ‘doctor view’ update built steadily to ~83K

  • 08:21 - First report in user channel

  • 12:07 - Picked up for investigation

  • 12:07 to 14:00ish - Checked database & ElasticSearch index

  • 13:00 - Checked the return list of GMC for north west

  • 14:00ish - Found messages in reval.queue.masterdoctorview.updated.recommendation didn’t get consumed

  • 16:00ish - Force a new start of recommendation service

  • 16:00ish - Identified that rabbitMq was reporting 0 consumers

  • 17:00ish to 18:00ish - Identified error in queue declaration. Raised, merged and pushed a PR

  • 18:00ish - Noticed that messages are briefly consumed on startup but number of consumers quickly drops to 0

  • 18:40ish - Final redeploy of recommendation service cleared out backlog and appeared to restore consumers stably

  • 18:40ish - Identified that there was still some discrepancies in the data between masterdoctorindex and recommendation index, decided to wait until after overnight doctor sync to do quick reindex

  • 09:08 - Informed users of reindex (brief downtime expected)

  • 11:09 - Reindex complete, service restored


Root Cause(s)

  • Doctors reported as not showing in the search list

  • ElasticSearch Index for Recommendation Service isn’t being updated

  • Large backlog of messages stuck on a queue for updating the index

  • Message Consumers disappeared but after the final aws ecs update-service --force-new-deployment dropped to one before going back up to 3

  • ?


Action Items

Action Items

Comments

Owner

Monitoring for queue depth, consumption or some other combined metric to say whether messages are being processed ‘acceptably’.


Lessons Learned

  • No labels