2023-07-23 Some signed CoJ not visible to Local Offices

Date

Jul 23, 2023

Authors

@Reuben Roberts @Andy Dingley

Status

Done

Summary

Some CoJs signed in TIS Self-Service between the evening of 23rd July and the morning of 24th July were not correctly loaded into TIS, and hence not visible to Local Offices.

Impact

Local Offices saw some trainees as not having signed their CoJs when they had in fact signed them.

Non-technical Description

The connection between TIS Self-Service and TIS was broken intermittently 23-24 July 2023. As a result, the COJ signed in TIS Self-Service during that period were not loaded into TIS, which meant that several programme memberships did not have the correct COJ status in Admins UI and Tableau.


Trigger

  • RabbitMQ instability in the time period mentioned.


Detection


Resolution

  • Audit of CoJ’s signed during the period 23 - 24th July 2023 uncovered 10 that had not been processed correctly and received by TIS. These were manually resent and their correct processing verified.


Timeline

All times in BST unless indicated

  • Jul 23, 2023: ~23:50 - RabbitMQ instability begins within weekly maintenance window (10PM - Midnight UTC / 11PM - 1AM BST).

    Screenshot of table showing alert
  • Jul 24, 2023: 02:10 - Exceptions in services

  • Jul 24, 2023: ~16:15 - RabbitMQ instability ends. Some services (MySQL CDC, ESR Reconciliation) which would have added message load were paused while messages in the broker were consumed and/or cleared out.

  • Aug 4, 2023: 10:30 - Missing CoJ in TIS reported.

  • Aug 7, 2023: 16:24 - Audit of CoJ messages completed and missing CoJ resent to TIS.

Root Cause(s)

  • RabbitMQ became unstable due to resource limits being reached (memory, due to excessive messages being held and not processed).

  • TIS Self-Service code to submit messages to RabbitMQ did not check for successful processing.

  • CoJs were successfully saved within TIS Self-Service, but not received by TIS due to messaging failure.

  • The lack of alerting of the failures, or of the resulting data discrepancy, meant that we relied on user reports to become aware of the issue.


Action Items

Action Items

Owner

 

Action Items

Owner

 

CoJ audit to identify data discrepancies

@Reuben Roberts

DONE: https://hee-tis.atlassian.net/browse/TIS21-4923

Manual patch to restore data integrity

@Reuben Roberts

DONE: https://hee-tis.atlassian.net/browse/TIS21-4924

Improve TIS Self-Service messaging code to detect failures

@Reuben Roberts

IN PROGRESS: https://hee-tis.atlassian.net/browse/TIS21-4925

Add monitoring for idle queues (where messages are available, but not being consumed or without a listener)

@Joseph (Pepe) Kelly

TODO

Add monitoring for Rabbit broker health

@Joseph (Pepe) Kelly

https://hee-tis.atlassian.net/browse/TIS21-4929


Lessons Learned

  • We need to check and handle errors from ‘infrastructure’ more carefully

  • The lack of alerting on failures and/or automated data consistency checks meant we were not aware of the problem until notified by users, which is poor.