2021-04-28 PersonOwner sync job failed affecting Person Search
Date | Apr 28, 2021 |
Authors | @Joseph (Pepe) Kelly @Marcello Fabbri (Unlicensed) @Reuben Roberts |
Status | Done |
Summary |
https://hee-tis.atlassian.net/browse/TIS21-1522 Also see: 2020-05-04 PersonOwner job failed affecting local office filters |
Impact | Users had an inaccurate list of People on Admins-UI |
Non-technical Description
The PersonOwnerJob failed to run successfully.
Trigger
Error while rebuilding the PersonOwner table due to a duplicate entry
Detection
Slack notification in #monitoring-prod
User noticed through the Person Search page job were being rerun
Resolution
Rerun of PersonOwnerRebuildJob and PersonElasticSearchSyncJob.
Timeline
Apr 28, 2021: 01:06 BST - PersonOwnerJob fails
Apr 28, 2021: 01:06 BST - Slack notification reporting the failure
Apr 28, 2021: 06:15 BST - PersonOwnerJob re-triggered manually
Apr 28, 2021: 06:20 BST - PersonOwnerJob finished successfully
Apr 28, 2021: 06:58 BST - PersonElasticSearchSyncJob re-triggered
Apr 28, 2021: 07:09 BST - PersonElasticSearchSyncJob finished successfully
Root Cause(s)
Job failed due to a duplicate entry (60021) for key ‘person_owner_pk’
Log from TCS:2021-04-28 00:06:50.446 WARN 1 --- [onPool-worker-2] o.h.engine.jdbc.spi.SqlExceptionHelper : SQL Error: 1062, SQLState: 23000
2021-04-28 00:06:50.447 ERROR 1 --- [onPool-worker-2] o.h.engine.jdbc.spi.SqlExceptionHelper : Duplicate entry '60021' for key 'person_owner_pk'
2021-04-28 00:06:50.453 ERROR 1 --- [onPool-worker-2] u.n.tis.sync.job.PersonOwnerRebuildJob : Error calling CallableStatement.getMoreResults; SQL [build_person_localoffice]; constraint [person_owner_pk]; nested exception is org.hibernate.exception.ConstraintViolationException: Error calling CallableStatement.getMoreResults
Procedure doesn’t forbid users from interacting with the table that’s being dropped and repopulated while being run, but assumes that this will not happen.
Logs show that a placement was amended during the timeframe that the PersonOwner sync job was running, which caused the duplicate entry in the PersonOwner table.
Action Items
Action Items | Owner |
|
---|---|---|
Make the PersonOwner rebuild process more robust against possible duplicate keys, e.g. https://hee-tis.atlassian.net/browse/TIS21-1532 (similar to https://hee-tis.atlassian.net/browse/TISNEW-4459 ) |
|
|
Review other sync jobs to see if they might face a similar issue (they may have their own primary key e.g. PersonTrust, and so not have the same impact, but may have the same underlying assumptions). |
|
|
|
| |
|
|
|
Lessons Learned
Be aware of risks of jobs that delete and attempt to recreate data.
Be aware of dependencies between sync jobs: not only the one that failed may need to be re-run.
Slack: https://hee-nhs-tis.slack.com/
Jira issues: https://hee-tis.atlassian.net/issues/?filter=14213