2023-07-23 Some signed CoJ not visible to Local Offices
Date | Jul 23, 2023 |
Authors | @Reuben Roberts @Andy Dingley |
Status | Done |
Summary | Some CoJs signed in TIS Self-Service between the evening of 23rd July and the morning of 24th July were not correctly loaded into TIS, and hence not visible to Local Offices. |
Impact | Local Offices saw some trainees as not having signed their CoJs when they had in fact signed them. |
Non-technical Description
The connection between TIS Self-Service and TIS was broken intermittently 23-24 July 2023. As a result, the COJ signed in TIS Self-Service during that period were not loaded into TIS, which meant that several programme memberships did not have the correct COJ status in Admins UI and Tableau.
Trigger
RabbitMQ instability in the time period mentioned.
Detection
Missing CoJ reported 10:30AM 4 Aug 2023 on the self-service-support Slack channel (https://hee-nhs-tis.slack.com/archives/G0135LS4JVA/p1691141400259379 ), and another example reported shortly thereafter.
Resolution
Audit of CoJ’s signed during the period 23 - 24th July 2023 uncovered 10 that had not been processed correctly and received by TIS. These were manually resent and their correct processing verified.
Timeline
All times in BST unless indicated
Jul 23, 2023: ~23:50 - RabbitMQ instability begins within weekly maintenance window (10PM - Midnight UTC / 11PM - 1AM BST).
Jul 24, 2023: 02:10 - Exceptions in services
Jul 24, 2023: ~16:15 - RabbitMQ instability ends. Some services (MySQL CDC, ESR Reconciliation) which would have added message load were paused while messages in the broker were consumed and/or cleared out.
Aug 4, 2023: 10:30 - Missing CoJ in TIS reported.
Aug 7, 2023: 16:24 - Audit of CoJ messages completed and missing CoJ resent to TIS.
Root Cause(s)
RabbitMQ became unstable due to resource limits being reached (memory, due to excessive messages being held and not processed).
TIS Self-Service code to submit messages to RabbitMQ did not check for successful processing.
CoJs were successfully saved within TIS Self-Service, but not received by TIS due to messaging failure.
The lack of alerting of the failures, or of the resulting data discrepancy, meant that we relied on user reports to become aware of the issue.
Action Items
Action Items | Owner |
|
---|---|---|
CoJ audit to identify data discrepancies | @Reuben Roberts | |
Manual patch to restore data integrity | @Reuben Roberts | |
Improve TIS Self-Service messaging code to detect failures | @Reuben Roberts | IN PROGRESS: https://hee-tis.atlassian.net/browse/TIS21-4925 |
Add monitoring for idle queues (where messages are available, but not being consumed or without a listener) | @Joseph (Pepe) Kelly | TODO |
Add monitoring for Rabbit broker health | @Joseph (Pepe) Kelly |
Lessons Learned
We need to check and handle errors from ‘infrastructure’ more carefully
The lack of alerting on failures and/or automated data consistency checks meant we were not aware of the problem until notified by users, which is poor.
Slack: https://hee-nhs-tis.slack.com/
Jira issues: https://hee-tis.atlassian.net/issues/?filter=14213