Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Expand
titleInitial Refinement informing Design work a.k.a. Connections (... and Reval) ~Syncing~ Cache Design

Problems:
Same work as recommendations assumption, we SHOULD do the same

  • CDC and the "separate indexes"

  • Level of change: we could do minimal updates to what is implemented now

  • “Logic” Implementing in different languages / at different times

    • Principal: Do it once!

    • Logic in writing? "just the mapping"? Mapping is logic... still have 2 processes

    • Logic

  • Filters would need to be duplicated if using different indexes:

    • Implementation-dependent

  • Hypothesis: Performance of querying a single index vs. separate indexes

  • Hypothesis: Performance of writing to several indexes

• Hypothesis: It would all be much simpler if we were copying all programme memberships separately

  • What level of mapping is required between the base (master index) and connections?

    • Next to none... probably not actually necessary

Principals:

  • Write it (code) once - Questions over this. How much does this run counter to Microservice Design

  • Unnecessary redundancy. Data stored in several places.

    • When should we have copies of the data?:

Solutions:

  • Assumption:

  • Fields available in the Base (Master) Index similar enough to the fields needed across all connections screens.

  • Data Design

    • Steve's query on schema design

    • When do we Specifically do we need to persist:

      • connection status field, given we have gmcDesignatedBody

      • Programme Memberships & Whether a discrepancy is hidden

Three ways of caching data, the first might not be viable because of connection specific info:

  • Single Index across REVAL? (Probably not, given the likely need of holding connection specific info?)

  • Single Index for Connections

  • Single Index for each tab

...

The approach of “pre-sorting” the data was also fine before as the exact same code was used for CDC and the ES Resync job. However, in order to repeat the massive time saving we achieved in

Jira Legacy
serverSystem JIRA
serverId4c843cd5-e5a9-329d-ae88-66091fcfe3c7
keyTIS21-3416
for the Connections service, we have to use the Elasticsearch “reindex” operation, which means we would have to duplicate the logic we have written in Java now in ES query language as part of the reindex request - and then maintain both separately.

Summary

  • (plus) Having multiple indexes makes GET requests simpler

    • performance has been raised as a potential benefit, but when more complex queries on large data sets take less than a second it’s questionable how much benefit this would really give.

  • (plus) It’s what we’ve got already 🤷‍♀️

  • (minus) Multiple indexes means duplicating data

  • (minus) Multiple indexes makes requires multiple updates for a single data change

  • (minus) Because we have separate CDC and Resync processes, and because the Java approach is prohibitively slow for the Resync process, we would have to write and maintain the business logic in separate places in separate languages

  • (minus)Every time we make a change to the business logic, we would have to do a full resync!

Tasks to complete TIS21-3774 with this approach

...

masterdoctorindex Fields

Required by Recommendations

Required by Connections

id

tcsPersonId

gmcReferenceNumber

doctorFirstName

doctorLastName

submissionDate

ProgrammeName

membershipType

designatedBody

gmcStatus

tisStatus

admin

lastupdatedDate

underNotice

tcsDesignatedBody

programmeOwner

curriculumEndDate

connectionStatus

membershipStartDate

membershipEndDate

existsInGmc

exceptionReason*

*this field is currently in the code in connections but doesn’t exist in masterdoctorindex, appears to have been overlooked

As we can see, both services share a lot of fields, so this could be motivation for either:

...

This should be a fairly straightforward conversion, for example where we currently pre-sort with Java if(<conditions for discrepancy>) we would instead GET with a Where <fieldValue> = <condition for discrepancy> in Elasticsearch.

Summary

  • (plus) Generally simplifies the system architecture

  • (plus) This approach means only need to implement the business logic in one place in one language

  • (plus) A single index means we’re not duplicating data unnecessarily and simplifies the update process

  • (plus) Removing the “pre-sort” step greatly simplifies the CDC and Resync processes and makes it more consistent with how we do Recommendations

  • (minus) GET requests become more complicated than in the current approach

    • Although no more complicated than what we have on Recommendations, and implementing filters becomes more consistent and straightforward

  • (minus) Is there a business logic case we couldn’t replicate using a query language as opposed to Java?

Tasks to complete TIS21-3774 with this approach

...

(all the advantages and disadvantages of the single connection index approach, and:)

  • (plus) Massively simplifies the system architecture

  • (plus) A single index means we’re not duplicating data unnecessarily and simplifies the update process

  • (minus) Doesn’t save any significant impact on the speed of the sync process (reindex is really quick!)

  • (minus) Awkward request design, either having to implement API call methods for different services in the Integration service, or having to make extra “back and forth” requests between services - less “separation of concerns”?

  • (minus) Less flexible if we have different filtering requirements for the same fields in different services (when calling reindex, we can specify field mapping metadata that enables different search behaviour e.g. wildcard)

Tasks to complete TIS21-3774 with this approach

...