Date |
|
Authors | |
Status | In progressDone |
Summary | Some CoJs signed in TIS Self-Service between the evening of 23rd July and the morning of 24th July were not correctly loaded into TIS, and hence not visible to Local Offices. |
Impact | Local Offices saw some trainees as not having signed their CoJs when they had in fact signed them. |
...
All times in BST unless indicated
: ~22~23:50 - RabbitMQ instability begins within weekly maintenance window (10PM - Midnight UTC / 11PM - 1AM BST).
: 02:10 - Exceptions in services
: ~15~16:15 - RabbitMQ instability ends. Some services (MySQL CDC, ESR Reconciliation) which would have added message load were paused while messages in the broker were consumed and/or cleared out.
: 10:30 - Missing CoJ in TIS reported.
: 16:24 - Audit of CoJ messages completed and missing CoJ resent to TIS.
...
Action Items | Owner | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
CoJ audit to identify data discrepancies | DONE:
| |||||||||
Manual patch to restore data integrity | DONE:
| |||||||||
Improve TIS Self-Service messaging code to detect faluresfailures | IN PROGRESS:
| |||||||||
Add monitoring for idle queues (where messages are available, but not being consumed or without a listener) | TODO | |||||||||
Add monitoring for Rabbit broker health |
|
...
Lessons Learned
We need to check and handle errors from ‘infrastructure’ more carefully
The lack of alerting on failures and/or automated data consistency checks meant we were not aware of the problem until notified by users, which is poor.