Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Implementing

Date

Authors

Reuben Roberts

Status

Done

Summary

Some TISSS trainee profiles have incomplete Programme Membership data

Impact

TISSS trainee profile data missing for some trainees

...

Trainee profile data is copied into TIS Self-Service from TIS. This happens whenever data is added or updated in TIS. Our monitoring tool alerted us that some TIS programme membership data was repeatedly failing to be loaded successfully into TIS Self-Service. The issue arose on 28 Dec 20232022, and was resolved on 12 Jan 2023.

...

Trigger

  • The It is possible that the ID’s of some CurriculumMembership records had changed. We use the ID’s as a primary key. (The assumed consequence of this might be duplication in TSS, not a failure to sync the data successfully, so this needs to be clarified)Database writes were compromised by excess optimistic locking errors.

...

Detection

  • CloudWatch Alarm in the #monitoring Slack channel for messages in the tis-trainee-sync-prod-dlq (Since 2022-12-28 10:23:34)

...

  • 11:25 - Alerted users on TIS Self-Service Teams channel that we would be doing maintenance, and proceeded with the following:
    Clear-down the curriculumMembership table in the sync database
    Clear-down the tis-trainee-sync-prod-dlq
    Reload the TCS CurriculumMembership data in the DMS

  • 11:20 Reworked mongoDB cluster deployed for tis-trainee-sync (prod)

  • 12:27 Reworked mongoDB cluster deployed for tis-trainee-details (prod)

...

Timeline

All times in GMT unless indicated

  • : 09:00 - First CloudWatch Alarm in the #monitoring Slack channel for messages in the tis-trainee-sync-prod-dlq

  • : 12:44 - Investigation into the messages on the DLQ

  • 11:25 - Alerted users on TIS Self-Service Teams channel that we would be doing maintenance, and proceeded with the following:
    Clear-down the curriculumMembership table in the sync database
    Clear-down the tis-trainee-sync-prod-dlq
    Reload the TCS CurriculumMembership data in the DMS

  • 11:00 (approx) Redrive all messages in DQL

  • 11:20 Reworked mongoDB cluster deployed for tis-trainee-sync (prod)

  • 12:27 Reworked mongoDB cluster deployed for tis-trainee-details (prod)

  • 16:30 No messages have arrived in tis-trainee-sync-prod-dlq

Root Cause(s)

  • Valid Curriculum Membership records were unexpectedly accumulating in the tis-trainee-sync-prod-dlq, and traineeProfile records were not being updated

  • tis-trainee-details was throwing lots of optimistic record locking exceptions for these updates, which was causing the tis-trainee-sync message processing to fail

  • Our hypothesis is that changes to the Mongo DB cluster in early December 2022 are implicated in this, since no other obvious changes to infrastructure or processing have been made recently

    • The use of cluster replicas for record reads may have increased our processing speed in tis-trainee-sync. Since the changes are pushed to tis-trainee-details using REST requests, there is no throttling of this, resulting in a much higher collision rate as the same records are updated almost simultaneously by different instances of tis-trainee-details.

    • The use of cluster replicas for record reads may retrieve stale data if the same record is written and read in short order (10-20ms, before eventual consistency). This could result in excess optimistic locking failures if a stale trainee profile is retrieved from a cluster replica (assuming an update to that profile document has been made on the master node, but the changes has not yet been distributed to the other nodes). Any update to a document thus retrieved will always fail, since it was not the latest version when it was first retrieved.

...

Action Items

Owner

Clear and resync curriculumMembership data (NO-TICKET)

Reuben Roberts

DONE

Jira Legacy
serverSystem JIRA
serverId4c843cd5-e5a9-329d-ae88-66091fcfe3c7
keyTIS21-4064

John Simmons (Deactivated)Add queue between tis-trainee-sync and tis-trainee-details

DONE

Jira Legacy
serverSystem JIRA
serverId4c843cd5-e5a9-329d-ae88-66091fcfe3c7
keyTIS21-4076

Reuben Roberts

TICKETED

...

Lessons Learned

  • We need to investigate Production DLQ messages more promptly, since they indicate data issues that will affect PGDiTs