Date |
| ||||||||
Authors | |||||||||
Status | Production no longer impactedDone | ||||||||
Summary |
| ||||||||
Impact | The recommendations search page was not being updated for a number of hours through the day |
...
GMC doctors that should have been under notice were not showing up in the “Under Notice” list.
...
Detection
Message from 2 users
...
BST unless otherwise stated
02:26 Earliest identifiable point of “something going wrong” - still unknown
02:26 to 08:07 - Queue to recommendation for ‘doctor view’ update built steadily to ~83K -
08:21 - First report in user channel
12:07 - Picked up for investigation
12:07 to 14:00ish - Checked database & ElasticSearch index
13:00 - Checked the return list of GMC for north west
14:00ish - Found messages in reval.queue.masterdoctorview.updated.recommendation didn’t get consumed
14:14ish - Having found that there were “recommendation status checks” being processed from overlapping runs, we reduced the frequency of checking for updates on submitted recommendations
16:00ish - Force a new start of recommendation service
16:00ish - Identified that rabbitMq was reporting 0 consumers
17:00ish to 18:00ish - Identified error in queue declaration. Raised, merged and pushed a PR
18:00ish - Noticed that messages are briefly consumed on startup but number of consumers quickly drops to 0
18:40ish - Final redeploy of recommendation service cleared out backlog and appeared to restore consumers stably
18:40ish - Identified that there was still some discrepancies in the data between masterdoctorindex and recommendation index, decided to wait until after overnight doctor sync to do quick reindex
~07:30 - Checked for doctors that were mentioned and found both were appearing
09:08 - Informed users of reindex (brief downtime expected)
11:09 - Reindex complete, service restored
...
Doctors reported as not showing in the search list
ElasticSearch Index for Recommendation Service isn’t being updated
Large backlog of messages stuck on a queue for updating the index
Message Consumers disappeared but after the final
aws ecs update-service --force-new-deployment
dropped to one before going back up to 3?This may have been related to load & thread starvation.
...
Action Items
Action Items | Comments | Owner | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Monitoring for queue depth, consumption or some other combined metric to say whether messages are being processed ‘acceptably’, e.g. Number of consumers, message rate while there are messages, message depth for a period of time / latency | Cai Willis , Joseph (Pepe) Kelly
| |||||||||
|
| |||||||||
Test / Replicate in stage… by loading c.100K messages onto the queue that was affected to see if this was likely a cause of the defect |
| |||||||||
Compare indexes/indices composite doctor view & recommendation view to check for the potential scale of the problem | ||||||||||
Can we get extra application insights / metrics about thread usage & other resources | ||||||||||
Consider a “drop & rebuild” of the index? | Will consider as part of
|
...
Lessons Learned
We need more monitoring on our rabbitmq queues and activity