Date |
|
Authors | |
Status | In progressDone |
Summary | Some CoJs signed in TIS Self-Service between the evening of 23rd July and the morning of 24th July were not correctly loaded into TIS, and hence not visible to Local Offices. |
Impact | Local Offices saw some trainees as not having signed their CoJs when they had in fact signed them. |
...
All times in BST unless indicated
: xx~23:xx50 - RabbitMQ instability begins within weekly maintenance window (10PM - Midnight UTC / 11PM - 1AM BST).
: 02:10 - Exceptions in services
: xx~16:xx15 - RabbitMQ instability ends. Some services (MySQL CDC, ESR Reconciliation) which would have added message load were paused while messages in the broker were consumed and/or cleared out.
: 10:30 - Missing CoJ in TIS reported.
: 16:24 - Audit of CoJ messages completed and missing CoJ resent to TIS.
Root Cause(s)
RabbitMQ became unstable due to xxxx.resource limits being reached (memory, due to excessive messages being held and not processed).
TIS Self-Service code to submit messages to RabbitMQ did not check for successful processing.
CoJs were successfully saved within TIS Self-Service, but not received by TIS due to messaging failure.
The lack of alerting of the failures, or of the resulting data discrepancy, meant that we relied on user reports to become aware of the issue.
...
Action Items | Owner | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
CoJ audit to identify data discrepancies | DONE:
| |||||||||
Manual patch to restore data integrity | DONE:
| |||||||||
Improve TIS Self-Service messaging code to detect faluresfailures | IN PROGRESS:
| |||||||||
Add monitoring for idle queues (where messages are available, but not being consumed or without a listener) | TODO | |||||||||
Add monitoring for Rabbit broker health |
|
...
Lessons Learned
We need to check and handle errors from ‘infrastructure’ more carefully
The lack of alerting on failures and/or automated data consistency checks meant we were not aware of the problem until notified by users, which is poor.