2021-04-28 PersonOwner sync job failed affecting Person Search

Date

Apr 28, 2021

Authors

@Joseph (Pepe) Kelly @Marcello Fabbri (Unlicensed) @Reuben Roberts

Status

Done

Summary

 

https://hee-tis.atlassian.net/browse/TIS21-1522

Also see: 2020-05-04 PersonOwner job failed affecting local office filters

Impact

Users had an inaccurate list of People on Admins-UI

Non-technical Description

The PersonOwnerJob failed to run successfully.


Trigger

  • Error while rebuilding the PersonOwner table due to a duplicate entry


Detection

  • Slack notification in #monitoring-prod

     

  • User noticed through the Person Search page job were being rerun




Resolution

  • Rerun of PersonOwnerRebuildJob and PersonElasticSearchSyncJob.


Timeline

  • Apr 28, 2021: 01:06 BST - PersonOwnerJob fails

  • Apr 28, 2021: 01:06 BST - Slack notification reporting the failure

  • Apr 28, 2021: 06:15 BST - PersonOwnerJob re-triggered manually

  • Apr 28, 2021: 06:20 BST - PersonOwnerJob finished successfully

  • Apr 28, 2021: 06:58 BST - PersonElasticSearchSyncJob re-triggered

  • Apr 28, 2021: 07:09 BST - PersonElasticSearchSyncJob finished successfully

Root Cause(s)

  • Job failed due to a duplicate entry (60021) for key ‘person_owner_pk’
    Log from TCS:
    2021-04-28 00:06:50.446  WARN 1 --- [onPool-worker-2] o.h.engine.jdbc.spi.SqlExceptionHelper   : SQL Error: 1062, SQLState: 23000
    2021-04-28 00:06:50.447 ERROR 1 --- [onPool-worker-2] o.h.engine.jdbc.spi.SqlExceptionHelper   : Duplicate entry '60021' for key 'person_owner_pk'
    2021-04-28 00:06:50.453 ERROR 1 --- [onPool-worker-2] u.n.tis.sync.job.PersonOwnerRebuildJob   : Error calling CallableStatement.getMoreResults; SQL [build_person_localoffice]; constraint [person_owner_pk]; nested exception is org.hibernate.exception.ConstraintViolationException: Error calling CallableStatement.getMoreResults

  • Procedure doesn’t forbid users from interacting with the table that’s being dropped and repopulated while being run, but assumes that this will not happen.

  • Logs show that a placement was amended during the timeframe that the PersonOwner sync job was running, which caused the duplicate entry in the PersonOwner table.


Action Items

Action Items

Owner

 

Action Items

Owner

 

Make the PersonOwner rebuild process more robust against possible duplicate keys, e.g. https://hee-tis.atlassian.net/browse/TIS21-1532 (similar to https://hee-tis.atlassian.net/browse/TISNEW-4459 )

 

Review other sync jobs to see if they might face a similar issue (they may have their own primary key e.g. PersonTrust, and so not have the same impact, but may have the same underlying assumptions).

 

https://hee-tis.atlassian.net/browse/TISNEW-3124

 

 

 

 

 


Lessons Learned

  • Be aware of risks of jobs that delete and attempt to recreate data.

  • Be aware of dependencies between sync jobs: not only the one that failed may need to be re-run.