2023-02-03 Elasticsearch Indexes and Structure Revisited - Proposal 1
Previous Discussions
https://hee-tis.atlassian.net/wiki/spaces/NTCS/pages/3696689344
https://hee-tis.atlassian.net/wiki/spaces/NTCS/pages/3702784003
Summary of findings
There is an existing Sync Process:
The existing Elasticsearch resync job approach for connections is impractical as it takes over ~6 hours (unknown what total time actually is) to complete.
There are two stages to the job:
Data is synchronised and aggregated from GMC and TIS (approx 1.5-2 hours)
Aggregated data is then sorted into Recommendation and Connection indexes (unknown duration but last attempt aborted at over ~4 hours)
We solved the speed issue in b. with an reindex operation for the Recommendation index - this takes minutes, at most. This was possible in Recommendations as it’s just a straight 1:1 copy of the data from masterdoctorindex to recommendationindex
We cannot do this for the existing Connection indexes, as is it not a 1:1 copy operation. There are currently 3 connections indexes that contain subsets of masterdoctorindex according to business logic.
Queries that need to be supported:
That business logic was not well understood in terms of actual code implementation, so we were unable to agree on an architecture solution for connections.
Here is the resultant investigation
Data Model:
The current implementation creates a number of additional fields (Are these actually necessary for the queries that need to be supported)
exceptionReason (not sure about this one, feel like there’s a better implementation)
membershipType for inactive doctors
Funny TIS Ids and GMC Ids coming from TCS
Proposal 1 - Single Index - some ahead-of-time field calculations
This is a hybrid approach of approach 1 and 3 from the initial discussion.
From our initial discussion, it’s clear that the original bottleneck for this job was the “ahead of time updates” to sort things into the right index.
From our initial discussion, it’s also not necessarily clear why recommendations and connections need their own indexes if we’re not sub-setting the data or fields.
From the investigation, it’s clear that most of the logic can be done at “query” time.
Given an elasticsearch index of all doctors, we can find all doctors with discrepancies (according to the MVP) using the following query:
WHERE designatedBody != tcsDesignatedBody
ORWHERE gmcReferenceNumber == null
ORWHERE tisPersonId == null
ORWhere membershipType = INACTIVE && existsInGmc == true
ORWHERE exceptionReason != null
This could be achieved fairly trivially using an Elasticsearch query instead of sorting ahead of time, except for the queries marked with (these could technically be achieved using Runtime Fields, but we’d have to upgrade our Elasticsearch/Opensearch cluster, which is probably too risky a job and might spin off other problems)
The proposal for a hybrid approach is the following:
Firstly, include the following fields in the data sent to reval services from TIS:
membershipType
We bin-off the separate indexes and use an alias on
masterdoctorindex
for it to be accessed bytis-connection-service
(read only)We create two new flag fields, “
dbcDiscrepancy
" and "gmcNumberDiscrepancy
"When a data comes through from either the ES resync Job or the CDC pipelines, we calculate the new fields using
designatedBody != tcsDesignatedBody
.Otherwise, when clicking the discrepancies tab, we filter doctors into those with discrepancies based on the parameters above. Example Elasticsearch query implementation shown in appendix of this page.
In the Front End, we can direct admins towards the general issue with some simple field-value tests to show the categories defined at the top of this document.
Advantages
Much simpler architecture
Much faster ES-Sync-Job (~2 hours total)
Logic maintained in one place*
Disadvantages
“Architecture Smell?” - some dilution of separation of concerns?
Doing the field calculations feels like it might be “connections work” that would be implemented in Integration service*
*Could be argued both ways, technically all the integration service is doing is checking if two fields are equal and setting a flag - feels quite integrationy
Appendix 1: Elasticsearch query example
"bool": {
"should": [
{ "match" : { "dbcDiscrepancy" : true } },
{"bool": {"must_not": {"exists": {"field": "gmcReferenceNumber"}}}},
{"bool": {"must_not": {"exists": {"field": "tcsPersonId"}}}},
{"bool": {"must" : [{"match" : { "membershipType" : "INACTIVE" }},{"match" : { "existsInGmc" : true }}]}},
{"exists": {"field": "exceptionReason"} }
],
"minimum_should_match": 1
}
Slack: https://hee-nhs-tis.slack.com/
Jira issues: https://hee-tis.atlassian.net/issues/?filter=14213