Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 7 Current »

Date

Authors

Reuben Roberts Andy Dingley

Status

In progress

Summary

Some CoJs signed in TIS Self-Service between the evening of 23rd July and the morning of 24th July were not correctly loaded into TIS, and hence not visible to Local Offices.

Impact

Local Offices saw some trainees as not having signed their CoJs when they had in fact signed them.

Non-technical Description

The connection between TIS Self-Service and TIS was broken intermittently 23-24 July 2023. As a result, the COJ signed in TIS Self-Service during that period were not loaded into TIS, which meant that several programme memberships did not have the correct COJ status in Admins UI and Tableau.


Trigger

  • RabbitMQ instability in the time period mentioned.


Detection


Resolution

  • Audit of CoJ’s signed during the period 23 - 24th July 2023 uncovered 10 that had not been processed correctly and received by TIS. These were manually resent and their correct processing verified.


Timeline

All times in BST unless indicated

  • : ~23:50 - RabbitMQ instability begins within weekly maintenance window (10PM - Midnight UTC / 11PM - 1AM BST).

    Screenshot of table showing alert
  • : 02:10 - Exceptions in services

  • : ~16:15 - RabbitMQ instability ends. Some services (MySQL CDC, ESR Reconciliation) which would have added message load were paused while messages in the broker were consumed and/or cleared out.

  • : 10:30 - Missing CoJ in TIS reported.

  • : 16:24 - Audit of CoJ messages completed and missing CoJ resent to TIS.

Root Cause(s)

  • RabbitMQ became unstable due to resource limits being reached (memory, due to excessive messages being held and not processed).

  • TIS Self-Service code to submit messages to RabbitMQ did not check for successful processing.

  • CoJs were successfully saved within TIS Self-Service, but not received by TIS due to messaging failure.

  • The lack of alerting of the failures, or of the resulting data discrepancy, meant that we relied on user reports to become aware of the issue.


Action Items

Action Items

Owner

CoJ audit to identify data discrepancies

Reuben Roberts

DONE: TIS21-4923 - Getting issue details... STATUS

Manual patch to restore data integrity

Reuben Roberts

DONE: TIS21-4924 - Getting issue details... STATUS

Improve TIS Self-Service messaging code to detect failures

Reuben Roberts

IN PROGRESS: TIS21-4925 - Getting issue details... STATUS

Add monitoring for idle queues (where messages are available, but not being consumed or without a listener)

Joseph (Pepe) Kelly

TODO

Add monitoring for Rabbit broker health

Joseph (Pepe) Kelly

TODO


Lessons Learned

  • We need to check and handle errors from ‘infrastructure’ more carefully

  • The lack of alerting on failures and/or automated data consistency checks meant we were not aware of the problem until notified by users, which is poor.

  • No labels