2023-02-03 Elasticsearch Indexes and Structure Revisited - Proposal 1

Previous Discussions

https://hee-tis.atlassian.net/wiki/spaces/NTCS/pages/3696689344

https://hee-tis.atlassian.net/wiki/spaces/NTCS/pages/3702784003


Summary of findings

  • There is an existing Sync Process:

    • The existing Elasticsearch resync job approach for connections is impractical as it takes over ~6 hours (unknown what total time actually is) to complete.

    • There are two stages to the job:

      1. Data is synchronised and aggregated from GMC and TIS (approx 1.5-2 hours)

      2. Aggregated data is then sorted into Recommendation and Connection indexes (unknown duration but last attempt aborted at over ~4 hours)

    • We solved the speed issue in b. with an reindex operation for the Recommendation index - this takes minutes, at most. This was possible in Recommendations as it’s just a straight 1:1 copy of the data from masterdoctorindex to recommendationindex

    • We cannot do this for the existing Connection indexes, as is it not a 1:1 copy operation. There are currently 3 connections indexes that contain subsets of masterdoctorindex according to business logic.

  • Queries that need to be supported:

  • Data Model:

    • The current implementation creates a number of additional fields (Are these actually necessary for the queries that need to be supported)

      • exceptionReason (not sure about this one, feel like there’s a better implementation)

      • membershipType for inactive doctors

      • Funny TIS Ids and GMC Ids coming from TCS

Proposal 1 - Single Index - some ahead-of-time field calculations

This is a hybrid approach of approach 1 and 3 from the initial discussion.

From our initial discussion, it’s clear that the original bottleneck for this job was the “ahead of time updates” to sort things into the right index.

From our initial discussion, it’s also not necessarily clear why recommendations and connections need their own indexes if we’re not sub-setting the data or fields.

From the investigation, it’s clear that most of the logic can be done at “query” time.

 

Given an elasticsearch index of all doctors, we can find all doctors with discrepancies (according to the MVP) using the following query:

WHERE designatedBody != tcsDesignatedBody
OR
WHERE gmcReferenceNumber == null
OR
WHERE tisPersonId == null
OR
Where membershipType = INACTIVE && existsInGmc == true
OR
WHERE exceptionReason != null

This could be achieved fairly trivially using an Elasticsearch query instead of sorting ahead of time, except for the queries marked with (these could technically be achieved using Runtime Fields, but we’d have to upgrade our Elasticsearch/Opensearch cluster, which is probably too risky a job and might spin off other problems)

The proposal for a hybrid approach is the following:

  1. Firstly, include the following fields in the data sent to reval services from TIS: membershipType

  2. We bin-off the separate indexes and use an alias on masterdoctorindex for it to be accessed by tis-connection-service (read only)

  3. We create two new flag fields, “dbcDiscrepancy" and "gmcNumberDiscrepancy"

  4. When a data comes through from either the ES resync Job or the CDC pipelines, we calculate the new fields using designatedBody != tcsDesignatedBody.

  5. Otherwise, when clicking the discrepancies tab, we filter doctors into those with discrepancies based on the parameters above. Example Elasticsearch query implementation shown in appendix of this page.

  6. In the Front End, we can direct admins towards the general issue with some simple field-value tests to show the categories defined at the top of this document.

 

 

Advantages

  • Much simpler architecture

  • Much faster ES-Sync-Job (~2 hours total)

  • Logic maintained in one place*

Disadvantages

  • “Architecture Smell?” - some dilution of separation of concerns?

  • Doing the field calculations feels like it might be “connections work” that would be implemented in Integration service*

    • *Could be argued both ways, technically all the integration service is doing is checking if two fields are equal and setting a flag - feels quite integrationy

 

Appendix 1: Elasticsearch query example

"bool": { "should": [ { "match" : { "dbcDiscrepancy" : true } }, {"bool": {"must_not": {"exists": {"field": "gmcReferenceNumber"}}}}, {"bool": {"must_not": {"exists": {"field": "tcsPersonId"}}}}, {"bool": {"must" : [{"match" : { "membershipType" : "INACTIVE" }},{"match" : { "existsInGmc" : true }}]}}, {"exists": {"field": "exceptionReason"} } ], "minimum_should_match": 1 }