2025-04-04 HLE - Reval gets out of sync
Background
Reval Recommendation/Connection list are showing the data from EleasticSearch (indices: recommendationindex & masterdoctorindex).
A full sync will sync the current data and insert into ES indices.After the full sync:
when there’s an update on TIS, TIS will sync the single doctor info to Reval
Every minute, CDC process will pick up the changes in Reval DB (DoctorsForDB & Recommendation collections) and propagate the changes to ElasticSearch.Data source are:
TIS data
Revalidation data
Current issue
Reval data sometimes seem to be unreliable to admin users.
Why?
[Cai’s notes:
1. reval is “passive” so if an update is missed/a message fails, there’s no mechanism to refetch
2. Some unknown issue is resulting in duplicates - probably some bug in the upsert logic
]Is this mostly a TIS issue? How do we know?
Temporary fix
Trigger a full resync
Data flow is as below (embeded from Elastic Search Rebuild Sync Job ):
Issues and risks
How long does it take?
Taking 19/03/2025 for example, it took 1 hour and 45 mins (16:50 - 18:35) to clear the messages inreval.queue.connection.syncdata
, and it took about 4 hours and 40 mins to clear the messages in the SQS queuetis-revalidation-sync-gmc-queue-prod
(18:37 - 23:17)
Monitoring:
https://monitoring.tis.nhs.uk/grafana/d/Upe4ssXMk/rabbitmq-metrics-from-rabbitmq-exporter-granular?from=2025-03-19T16:00:00.000Z&to=2025-03-19T23:59:59.000Z&timezone=browser&var-datasource=e7UUtnuMk&var-environment=PROD&var-service=reval&refresh=5s
https://eu-west-2.console.aws.amazon.com/sqs/v3/home?region=eu-west-2#/queues/https%3A%2F%2Fsqs.eu-west-2.amazonaws.com%2F430723991443%2Ftis-revalidation-sync-gmc-queue-proddiscrepancies alias still needs to be added manually
So we need to wait for a few hours for the sync process to finish and then add the alias.
Long term fix
Making the sync faster and more reliable as a recovery system/ redue the complexity of the syncing in general
[Cai’s notes: it’s worth noting that the “full” sync system shares a lot of components with the CDC sync system, so improving efficiency in these areas will also improve the general running of the system
Periodically run Reval resync as a scheduled job
On TIS, every night we do a sync from TIS DB to ES person index to make sure TIS person list is updated. If we can reduce the sync time for Reval, we probably will be able to do Reval sync periodically.
Challenge: how to reduce Reval sync time? - Batch processing [Cai’s notes: SQS processing takes the majority of the time, if we introduced batch processing it may cut this down a huge amount, as seen in ESR]Get rid of recommendationindex and user another alias on masterdoctorindex for recommendation list (https://hee-tis.atlassian.net/browse/TIS21-5482 )
Advantage: no need to maintain 2 indices of duplicate data
[Cai’s notes: Also this contributes a lot to the amount of time it takes to run the sync, and introduces additional complexity and points of failure for messages]
Disadvantage: loss of flexibilitiesSupport single doctor resyncing from TIS? - is this already suppported? [Cai’s notes: Yes in that the query used to fetch tis data already allows for gmcnumber(s) to be specified]
Making Reval pipelines more robust
https://hee-tis.atlassian.net/browse/TIS21-6275 - this will hopefully remove the perception of “lag”
Refactoring the upsert logic in integration service (logic for matching tis id vs gmc id, null and empty checks, use of transactions)
Related content
Slack: https://hee-nhs-tis.slack.com/
Jira issues: https://hee-tis.atlassian.net/issues/?filter=14213