Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

All times in BST unless indicated

  • : ~22~23:50 - RabbitMQ instability begins.

  • : 02:10 - Exceptions in services

  • : ~15~16:15 - RabbitMQ instability ends. Some services (MySQL CDC, ESR Reconciliation) which would have added message load were paused while messages in the broker were consumed and/or cleared out.

  • : 10:30 - Missing CoJ in TIS reported.

  • : 16:24 - Audit of CoJ messages completed and missing CoJ resent to TIS.

...

Action Items

Owner

CoJ audit to identify data discrepancies

Reuben Roberts

DONE:

Jira Legacy
serverSystem JIRA
serverId4c843cd5-e5a9-329d-ae88-66091fcfe3c7
keyTIS21-4923

Manual patch to restore data integrity

Reuben Roberts

DONE:

Jira Legacy
serverSystem JIRA
serverId4c843cd5-e5a9-329d-ae88-66091fcfe3c7
keyTIS21-4924

Improve TIS Self-Service messaging code to detect failures

Reuben Roberts

IN PROGRESS:

Jira Legacy
serverSystem JIRA
serverId4c843cd5-e5a9-329d-ae88-66091fcfe3c7
keyTIS21-4925

Add monitoring for idle queues (where messages are available, but not being consumed or without a listener)

Joseph (Pepe) Kelly

TODO

Add monitoring for Rabbit broker health

Joseph (Pepe) Kelly

TODO

...

Lessons Learned

  • We need to check and handle errors from ‘infrastructure’ more carefully

  • The lack of alerting on failures and/or automated data consistency checks meant we were not aware of the problem until notified by users, which is poor.