Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Date

Authors

Cai Willis Yafang Deng Jayanta Saha Joseph (Pepe) Kelly

Status

Production no longer impacted

Summary

Jira Legacy
serverSystem JIRA
serverId4c843cd5-e5a9-329d-ae88-66091fcfe3c7
keyTIS21-3846

Impact

The recommendations search page was not being updated for a number of hours through the day

...

BST unless otherwise stated

  • 02:26 Earliest identifiable point of “something going wrong” - still unknown

  • 02:26 to 08:07 - Queue to recommendation for ‘doctor view’ update built steadily to ~83K

  • 08:21 - First report in user channel

  • 12:07 - Picked up for investigation

  • 12:07 to 14:00ish - Checked database & ElasticSearch index

  • 13:00 - Checked the return list of GMC for north west

  • 14:00ish - Found messages in reval.queue.masterdoctorview.updated.recommendation didn’t get consumed

  • 16:00ish - Force a new start of recommendation service

  • 16:00ish - Identified that rabbitMq was reporting 0 consumers

  • 17:00ish to 18:00ish - Identified error in queue declaration. Raised, merged and pushed a PR

  • 18:00ish - Noticed that messages are briefly consumed on startup but number of consumers quickly drops to 0

  • 18:40ish - Final redeploy of recommendation service cleared out backlog and appeared to restore consumers stably

  • 18:40ish - Identified that there was still some discrepancies in the data between masterdoctorindex and recommendation index, decided to wait until after overnight doctor sync to do quick reindex

  • 09:08 - Informed users of reindex (brief downtime expected)

  • 11:09 - Reindex complete, service restored

...

  • Doctors reported as not showing in the search list

  • ElasticSearch Index for Recommendation Service isn’t being updated

  • Large backlog of messages stuck on a queue for updating the index

  • Message Consumers disappeared but after the final aws ecs update-service --force-new-deployment dropped to one before going back up to 3

  • ?

...

Action Items

Action Items

Comments

Owner

Monitoring for queue depth, consumption or some other combined metric to say whether messages are being processed ‘acceptably’.

...