Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • 21:23 Recommendations service reports failure to connect to rabbitmq

  • - 09:41 Issue identified on revalidation by developer

  • ~10:00 Restarted TCS - recreated queues - -

  • 10:42 Revalidation doctors list restored, users notified

5 Whys (or other analysis of Root Cause)

Why was the doctors list not updated? Because the overnight sync failed to run
Why did the overnight sync fail to run? because the message to start the sync job was “stuck”
Why was the message to start the sync job stuck? because there was not enough available memory in the rabbitmq cluster and no new messages were being consumed

Why was there not enough memory in the rabbitmq cluster? because the neo-audit queue does not get processed quickly enough (1,000,000+ messages in the broker at all times) -other reasons?
Why does the neo-audit queue not get processed quickly enough? because it is running on old infrastructure and only has a couple of very slow instances consuming it

...

Action Items

Action Items

Owner

Comments

Complete this ticket

Cai Willis

...

Lessons Learned

  • Tech improvement tickets are important