2024-01-08 Trusts unable to find Trainees
Date | Jan 8, 2024 |
Authors | @Joseph (Pepe) Kelly |
Status | Documenting |
Summary | Trusts were unable to find a number of trainees in their search results. We narrowed in on there being an issue with the copy of information that gets searched and reran the job that builds it. |
Impact | It wasn’t immediately obvious that some records were not showing in the person search |
Non-technical Description
Report from users that different trusts unable to search for trainees. Example given that Test account (TestTrust.South@gmail.com) associated with University Hospital Southampton (RHM) and unable find Ramy Samia (GMC 7996933). But can find his post WES/RHM01/021/F2/003 and get to his placement that way.
Trigger
Detection
User reports in Teams
Resolution
Re-ran person sync job
Timeline
All times in GMT unless indicated
Jan 8, 2024 01: - Other Jobs ran for longer than usual and ran beyond the start of the Person ?ES? Job
Jan 8, 2024 01:29 - Job ran for 42 minutes, when it usually completes in 15-20 minutes.
Jan 8, 202412:01 - Message on Teams about Trust users not finding their trainees in the search.
Jan 8, 202413:37 - Started debugging and confirming the cause / that there were no other data related issues.
Jan 8, 202414:15 - Confirmed that other regions are affected. A reindex was scheduled.
Jan 8, 202415:45 & 16:00 - Confirmed that records were visible as expected.
As part of building the timeline, we didn’t identify an earlier occurrence of this defect so we have not sought to extensively reproduce and remedy this issue.
Root Cause(s)
N.B. We have developed a reasonable but not definitive explanation of what has happened.
Users in more than one region/Local Office couldn’t find trainees they were expecting because the search index didn’t have all the records it should have but we believe it did have many/most of the trainees they expected.
The ElasticSearch job completed but ran for longer than expected, as did other jobs
The ElasticSearch Job is dependent on on other jobs running successfully, roughly before it starts.
ElasticSearch, & other jobs work through pages of ids so where jobs overlap, this can lead to partial information being used instead of complete information
Action Items
Action Items | Owner |
|
---|---|---|
Alert when jobs (or just this job) runs outside the “normal”/”expected”/”acceptable” bounds, e.g.
| @Joseph (Pepe) Kelly | |
Space jobs out more to allow more time for each to run |
|
|
|
|
|
We could: rebuild as a batch job but won’t right now as it would be a significant piece of work |
|
|
Lessons Learned
Slack: https://hee-nhs-tis.slack.com/
Jira issues: https://hee-tis.atlassian.net/issues/?filter=14213