2024-11-07 New trainee profiles not triggering a full TSS data sync
Date | Nov 7, 2024 |
Authors | @Reuben Roberts |
Status | Done |
Summary | The requests for a full data sync for new trainee profiles were not being processed, resulting in missing data (e.g. email address) in trainee profiles. |
Impact | Onboarding emails were not being sent (recorded as ‘FAILED’) since TSS had no record of some new trainee email addresses. |
Non-technical Description
When a trainee profile is created for the first time, TSS sends a “DataRequest” message to TIS to trigger a full data sync for the matching Person record. However, the TSS sync service has been failing to process those messages since August 2024. As a consequence, these new profiles may be missing key information that is needed to populate and send onboarding notifications, e.g. trainee email address, since TSS would otherwise only get this information if it happened to be updated in TIS after the trainee profile was created.
A large number of recent trainee onboarding notifications are recorded as ‘failed’ due to a missing email address.
Trigger
A version update of a library component used to send and receive messages (spring-cloud-aws
upgraded from 2.4.x to 3.1.x, which we implemented in Dec 2023) introduced a deserialization message attribute with a class from the tis-trainee-details
service (JavaType=uk.nhs.hee.trainee.details.event.ProfileCreateEvent)
which the SQS listener in tis-trainee-sync
then unsuccessfully attempted to use to deserialize the message, since it does not have access to that class.
As there is no DLQ for the profile created queue, no AWS alarms exist and the Sentry error-reporting service appears to be misconfigured for the sync service. As a consequence the error remained undetected.
Detection
Issue of large numbers of ‘failed’ onboarding emails reported to TSS team 7 Nov 2024.
5 Whys (or other analysis of Root Cause)
Onboarding notification emails were not sent because no trainee email address was available in TSS.
No trainee email address was available in TSS because it had not been sync’d from TIS.
Data had not been sync’d from TIS because no full-data request had been made from TSS, and the records in question had not been updated since the profile was created, which would have triggered a normal sync.
No full-data request had been made from TSS because the TSS sync service could not deserialize the messages instructing it to make the request.
The TSS sync service could not deserialize the messages because of a new message attribute was instructing it to deserialize to a non-available class.
The new message attribute was included as part of a major version update to the library component that handles messaging.
The breaking change was not detected by automated tests within individual service components, because it only impacts messaging between components, for which we have no automated tests.
The subsequent failures in production were not detected because of a lack of error detection and reporting for the specific message queue and the sync component.
Resolution
tis-trainee-details configured to omit the JavaType attribute in messages it sends.
Timeline
All times GMT unless otherwise indicated.
Nov 7, 2024 10:30 As part of the discussion around https://hee-tis.atlassian.net/browse/TIS21-6673 , it was mentioned that unusually large numbers of emails for trainee onboarding were being recorded as ‘failed’ due to no email address, but that the trainees in question had email addresses in TIS.
Nov 7, 2024 ~21:00 redrive all production data to ensure trainees have email address and other profile data populated.
Nov 12, 2024 ~13:00 fix to tis-trainee-details to remove JavaType attribute of profile-created messages
Nov 13, 2024 ~13:00 DLQ and Cloudwatch alarm for profile-created queues
Nov 15, 2024 ~11:00 Sentry configuration corrected.
Action Items
Action Items | Owner |
|
---|---|---|
As per ticket | Various | Done |
|
|
|
|
|
|
See also:
Lessons Learned
Integration tests over microservices may be a bit of an anti-pattern, but alerts and DLQs are essential to catch unobserved errors.
Slack: https://hee-nhs-tis.slack.com/
Jira issues: https://hee-tis.atlassian.net/issues/?filter=14213