...
All times in BST unless indicated
: ~22~23:50 - RabbitMQ instability begins.
: 02:10 - Exceptions in services
: ~15~16:15 - RabbitMQ instability ends. Some services (MySQL CDC, ESR Reconciliation) which would have added message load were paused while messages in the broker were consumed and/or cleared out.
: 10:30 - Missing CoJ in TIS reported.
: 16:24 - Audit of CoJ messages completed and missing CoJ resent to TIS.
...
Action Items | Owner | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
CoJ audit to identify data discrepancies | DONE:
| |||||||||
Manual patch to restore data integrity | DONE:
| |||||||||
Improve TIS Self-Service messaging code to detect failures | IN PROGRESS:
| |||||||||
Add monitoring for idle queues (where messages are available, but not being consumed or without a listener) | TODO | |||||||||
Add monitoring for Rabbit broker health | TODO |
...
Lessons Learned
We need to check and handle errors from ‘infrastructure’ more carefully
The lack of alerting on failures and/or automated data consistency checks meant we were not aware of the problem until notified by users, which is poor.