Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
In progress

Date

Authors

Reuben Roberts Andy Dingley

Status

Done

Summary

Some CoJs signed in TIS Self-Service between the evening of 23rd July and the morning of 24th July were not correctly loaded into TIS, and hence not visible to Local Offices.

Impact

Local Offices saw some trainees as not having signed their CoJs when they had in fact signed them.

...

All times in BST unless indicated

  • : ~22~23:50 - RabbitMQ instability begins within weekly maintenance window (10PM - Midnight UTC / 11PM - 1AM BST).

    Screenshot of table showing alertImage Added
  • : 02:10 - Exceptions in services

  • : ~15~16:15 - RabbitMQ instability ends. Some services (MySQL CDC, ESR Reconciliation) which would have added message load were paused while messages in the broker were consumed and/or cleared out.

  • : 10:30 - Missing CoJ in TIS reported.

  • : 16:24 - Audit of CoJ messages completed and missing CoJ resent to TIS.

...

Action Items

Owner

CoJ audit to identify data discrepancies

Reuben Roberts

DONE:

Jira Legacy
serverSystem JIRA
serverId4c843cd5-e5a9-329d-ae88-66091fcfe3c7
keyTIS21-4923

Manual patch to restore data integrity

Reuben Roberts

DONE:

Jira Legacy
serverSystem JIRA
serverId4c843cd5-e5a9-329d-ae88-66091fcfe3c7
keyTIS21-4924

Improve TIS Self-Service messaging code to detect failures

Reuben Roberts

IN PROGRESS:

Jira Legacy
serverSystem JIRA
serverId4c843cd5-e5a9-329d-ae88-66091fcfe3c7
keyTIS21-4925

Add monitoring for idle queues (where messages are available, but not being consumed or without a listener)

Joseph (Pepe) Kelly

TODO

Add monitoring for Rabbit broker health

Joseph (Pepe) Kelly

Jira Legacy
serverSystem JIRA
serverId4c843cd5-e5a9-329d-ae88-66091fcfe3c7
keyTIS21-4929

...

Lessons Learned

  • We need to check and handle errors from ‘infrastructure’ more carefully

  • The lack of alerting on failures and/or automated data consistency checks meant we were not aware of the problem until notified by users, which is poor.