2023-01 Elastic Search Index(es) Quantity and Shape - TIS21-3774
Aims
It’s worth remembering what it is we are trying to achieve here.
For each of Revalidation Recommendations and Connections (and presumably in future – Concerns) we need to provide users with a relevant, fast-loading, accurate and filterable list of doctors.
This list of doctors needs to combine data from GMC, TIS and any other fields used by the service in question (e.g. recommendation workflow status for Recommendations)
We have already made the decision to use Elasticsearch to facilitate this. We use it almost like a cache – we load all of the data from all three sources ahead of time and keep it updated, but it is not used as a “source of truth”. The main benefit of Elasticsearch is that it is able to perform complicated queries ludicrously fast. We were able to cut the loading time of the “under notice” page from >2 minutes to 750 milliseconds compared to traditional database interaction and aggregation!
This discussion is fundamentally about looking at how we have structured the indexes (roughly equivalent to tables in a database) for Revalidation, and whether or not it is fit for purpose going forwards.
What do we have now?
For both Recommendations and Connections we have a list of doctors with fields aggregated from TIS, GMC and our Revalidation applications:
This data in this list is stored in Elasticsearch. For Recommendations, everything is run from a single index - recommendationsIndex (Note that there are two tabs available, “All Doctors” and “Under Notice”).
For connections this is slightly different.
Each tab has it’s own dedicated index (remember this is equivalent to having a separate table per tab in a MySQL database, for example) - in future the will be at least one additional tab. The indexes we currently have for connections are connectedindex, exceptionsindex and disconnectedindex.
We also have a fifth index that is manged by the Integration service. This service runs “behind the scenes” to manage all the incoming data and requests for Revalidation and ultimately routes everything to the correct destination (e.g. Connections). This fifth index is the masterdoctorindex and is currently used as the first step in the data aggregation process - this index is supposed to be the source for all other indexes, which are intended as subsets - though this isn’t really the case at the moment. More on this later.
All Revalidation data operations that need to be reflected in these lists first get applied to the masterdoctorindex - then any changes are propagated to the other indexes managed by the other services.
In the current implementation all the data in each of the indexes is effectively the same subset of fields (one or two exceptions) - meaning there is a lot of duplication.
There are three broad categories of data operation that affect the content of these indexes:
The ES Resync Job - This is a seldom used but very important operation in which clears out all of the Elasticsearch indexes and re-fetches all of the data from TIS and the GMC to re-populate Elasticsearch. Typical use cases for this are first-time data loading, recovery from catastrophic pipeline failure, mass population of a new field, etc. In terms of the current connections work, we need this working as all the data in connections is stale and needs re-populating.
Change Data Capture (CDC) - Great so we have a way of populating Elasticsearch, how do we keep it up to date? We listen for changes in TIS (using event listeners), GMC (using cron jobs) and our own Revalidation Applications (using Database changelogs) and send messages to the Integration service. The Integration service updates the masterdoctorindex, and these updates are propagated (copied) to the other indexes managed by the different services.
GET Requests - The data is in Elasticsearch and up to date! How do I see it? Load the web page! When you go on revalidation.tis.nhs.uk/recommendations or revalidation.tis.nhs.uk/connections the application sends a GET request to the relevant service which queries Elasticsearch to get the correct data. Elasticsearch is fast. In recommendations a fairly complex query with wildcard matching (pseudocode shown below) run against over 300,000 doctors takes between 500 and 750 milliseconds to complete!
underNotice = "Yes" AND designatedBody matches any from parameter 1 AND existsInGmc = true AND ( doctorFirstName = parameter 0 (including partial) OR doctorLastName = parameter 0 (including partial) OR gmcNumber = parameter 0 (including partial) OR programmeName = parameter 0 (including partial) OR return everything (i.e if no parameter provided) )
So what’s the problem?
The process we have for the ES Resync Job was incredibly slow - taking over 6 hours to complete. As the list page of the revalidation application is unavailable during this sync process, this means that running this on prod would result in a whole day’s worth of “downtime” (admins do most of their work based on this list)!
We’ve managed to solve the biggest bottleneck - reducing the overall time to just over 2 hours - by replacing the final stage of this process with a “reindex” operation. This “reindex” is essentially a “bulk copy”. Previously we were processing each of the several hundred thousand doctors one-by-one.
This has been implemented for the recommendation service, we now need to do the same for connections.
The problem is the structure of the Connection service’s indexes, and the fact that each of these indexes requires business logic to be applied at the point where we insert the data. For Recommendations the reindex is very straightforward - just a 1:1 copy of the contents of masterdoctorindex with some mapping configuration (terminology note: “mapping” in this case is more like metadata attached to specific fields that enable certain search features in Elasticsearch, not mapping as you might think - though it can also be like that . . . anyway . . .).
As mentioned previously, each of tabs on the Connections list page loads its data from a separate index, and each of these indexes is a subset of masterdoctorindex, sorted according to business logic. The original design was born out of the fact that at the time there was very little knowledge about how to write Elasticsearch queries within the team. To make the GET queries simpler we implemented the business logic about which tab each doctor should appear in “ahead of time”, i.e. we calculate which doctor should be shown in which tab during the CDC or Resync operations, not when we fetch the data. This has worked fine, but is roughly the equivalent of making 3 tables with identical schemas in a MySQL database in order avoid writing a WHERE clause.
The approach of “pre-sorting” the data was also fine before as the exact same code was used for CDC and the ES Resync job. However, in order to repeat the massive time saving we achieved in https://hee-tis.atlassian.net/browse/TIS21-3416 for the Connections service, we have to use the Elasticsearch “reindex” operation, which means we would have to duplicate the logic we have written in Java now in ES query language as part of the reindex request - and then maintain both separately.
Summary
Having multiple indexes makes GET requests simpler
performance has been raised as a potential benefit, but when more complex queries on large data sets take less than a second it’s questionable how much benefit this would really give.
It’s what we’ve got already
Multiple indexes means duplicating data
Multiple indexes makes requires multiple updates for a single data change
Because we have separate CDC and Resync processes, and because the Java approach is prohibitively slow for the Resync process, we would have to write and maintain the business logic in separate places in separate languages
Every time we make a change to the business logic, we would have to do a full resync!
Tasks to complete TIS21-3774 with this approach
Copy implementation for reindexing used by recommendations
Implement elasticsearch query to match java logic in CDC process for use as part of reindex api call
Clearly document that any changes made to the CDC logic in Connections must be repeated in Integration (is there something fancy we can do in github?)
BONUS: figure out if we can run the reindex for connections and recommendation separately (might be impractical)?
Interlude - Schema Design
The following table shows the fields currently stored in Elasticsearch, and their use within the two services we’re currently concerned with (no pun intended).
Note: The Connections fields are based on a cursory glance at the code in its current state, and may not be accurate, but will at least illustrate that not all fields are needed in each application, though the current implementation essentially provides all fields for all services.
masterdoctorindex Fields | Required by Recommendations | Required by Connections |
---|---|---|
id |
|
|
tcsPersonId |
|
|
gmcReferenceNumber |
|
|
doctorFirstName |
|
|
doctorLastName |
|
|
submissionDate |
|
|
ProgrammeName |
|
|
membershipType |
|
|
designatedBody |
|
|
gmcStatus |
|
|
tisStatus |
|
|
admin |
|
|
lastupdatedDate |
|
|
underNotice |
|
|
tcsDesignatedBody |
|
|
programmeOwner |
|
|
curriculumEndDate |
|
|
connectionStatus |
|
|
membershipStartDate |
|
|
membershipEndDate |
|
|
existsInGmc |
|
|
exceptionReason* *this field is currently in the code in connections but doesn’t exist in masterdoctorindex, appears to have been overlooked |
|
|
As we can see, both services share a lot of fields, so this could be motivation for either:
Take the redesign even further, having both just use a single shared index
Using an Ingest pipeline to only copy across relevant fields to the relevant service
Other things to note:
There’s no “isException” flag. Whether or not something appears in discrepanceiss is calculated by the values of certain fields, e.g. membershipType = “visitor” or programmeMembershipEndDate > now
Why do we even bother with a masterdoctorindex? - As both services generally need the same data, a shared main schema allows us to use exactly the same pipelines for getting GMC and TIS data loaded and aggregated into the revalidation system
Alternative Approach 1 - Single Connection index
For this approach, in the ES Resync Job we remove the “pre-sorting” step featured in the current approach and just perform a straightforward 1:1 copy (excluding schema discussion for now) from masterdoctorindex to a single connectionIndex, in the same way we do things for recommendationindex,
We also remove the “pre-sorting” from the CDC process and once again do the same as in recommendations - a simple 1:1 update.
Instead of “pre-sorting”, we now implemented the business logic as part of an Elasticsearch query on the GET side of things.
This should be a fairly straightforward conversion, for example where we currently pre-sort with Java if(<conditions for discrepancy>)
we would instead GET with a Where <fieldValue> = <condition for discrepancy>
in Elasticsearch.
Summary
Generally simplifies the system architecture
This approach means only need to implement the business logic in one place in one language
We can make changes to the business logic without having to resync all the data!
A single index means we’re not duplicating data unnecessarily and simplifies the update process
Removing the “pre-sort” step greatly simplifies the CDC and Resync processes and makes it more consistent with how we do Recommendations
GET requests become more complicated than in the current approach
Although no more complicated than what we have on Recommendations, and implementing filters becomes more consistent and straightforward
Is there a business logic case we couldn’t replicate using a query language as opposed to Java?
Tasks to complete TIS21-3774 with this approach
Copy implementation for reindexing used by recommendations
Replace Connected, Disconnected and Exception repositories with single Connection repositiory
Delete code for pre-sorting the doctors (ElasticsearchIndexUpdateHelper)
Have CDC listener update new index directly (see step 3)
Write ES query to GET Connected Doctors (based on currently implemented logic only)
Write ES query to GET Discrepancies (based on currently implemented logic only)
Alternative Approach 2 - While we’re at it, single Reval index?
Summary
(all the advantages and disadvantages of the single connection index approach, and:)
Massively simplifies the system architecture
A single index means we’re not duplicating data unnecessarily and simplifies the update process
Doesn’t save any significant impact on the speed of the sync process (reindex is really quick!)
Awkward request design, either having to implement API call methods for different services in the Integration service, or having to make extra “back and forth” requests between services - less “separation of concerns”?
Less flexible if we have different filtering requirements for the same fields in different services (when calling reindex, we can specify field mapping metadata that enables different search behaviour e.g. wildcard)
Tasks to complete TIS21-3774 with this approach
Remove reindex steps completely
Replace all ES implementation in Connections and Recommendations with a new implementation ported to Integration
Delete code for the above from affected service
Write all new ES queries for each requirement of each service
Related pages
Slack: https://hee-nhs-tis.slack.com/
Jira issues: https://hee-tis.atlassian.net/issues/?filter=14213