...
: 09:00 - First CloudWatch Alarm in the #monitoring Slack channel for messages in the tis-trainee-sync-prod-dlq
: 12:44 - Investigation into the messages on the DLQ
Root Cause(s)
Valid Curriculum Membership records were unexpectedly accumulating in the tis-trainee-sync-prod-dlq, and traineeProfile records were not being updated
tis-trainee-details was throwing lots of optimistic record locking exceptions for these updates, which was causing the tis-trainee-sync message processing to fail
Our hypothesis is that changes to the Mongo DB cluster in early December 2022 are implicated in this, since no other obvious changes to infrastructure or processing have been made recently
The use of cluster replicas for record reads may have increased our processing speed in tis-trainee-sync. Since the changes are pushed to tis-trainee-details using REST requests, there is no throttling of this, resulting in a much higher collision rate as the same records are updated almost simultaneously by different instances of tis-trainee-details.
The use of cluster replicas for record reads may retrieve stale data if the same record is written and read in short order (10-20ms, before eventual consistency). This could result in excess optimistic locking failures if a stale trainee profile is retrieved from a cluster replica (assuming an update to that profile document has been made on the master node, but the changes has not yet been distributed to the other nodes). Any update to a document thus retrieved will always fail, since it was not the latest version when it was first retrieved.
...
Action Items
Action Items | Owner | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
Clear and resync curriculumMembership data (NO-TICKET) | DONE | |||||||||
| ||||||||||
Add queue between tis-trainee-sync and tis-trainee-details |
...
Lessons Learned
We need to investigate Production DLQ messages more promptly, since they indicate data issues that will affect PGDiTs