Date	18 Feb 2022
Authors	Joseph (Pepe) Kelly John Simmons (Deactivated) Yafang Deng Reuben Roberts Jayanta Saha Edward Barclay
Status	Resolved
Summary	Person search sync job failed
Impact	Up to 1,000 person records (out of 290,000) that weren’t findable on the Person search page from 01.41 - 03.24

Non-technical Description

We run a number of sync jobs overnight. This one failed (see ‘Impact', above) - another process was taking place that prevented it from successfully running.

We re-ran the job shortly afterwards and it completed successfully.

We investigated what tripped it up and will work to mitigate a recurrence.

Trigger

garbage collection activity taking longer than expected and eating into the sync job schedule.

Detection

monitoring-prod Slack alert.

Resolution

re-running the job as soon as it was noticed.

Timeline

2022-02-18|01:41: “Sync [Person sync job] failed with exception…” message in Slack monitoring-prod channel.
2022-02-18|03:13: Team member restarted the job when they notice the issue.
2022-02-18|03:24: Rerun job completed successfully.

Root Cause(s)

see TIS21-2697.

Action Items

Action Items	Owner
Refactor the sync job to be robust enough to retry on error - a spike ticket to look at the options? - task executor like we do elsewhere in TIS? - REST client retry? - Spring component for retrying method calls (configurable) - example in Reval (thanks Uzair)?	Reuben Roberts

Lessons Learned

Its good to retry when you fail!
Even highly available systems have issues.
Task-based components could do with a bit more defensive development (around retries, consider things other than the ‘happy path’).
Our monitoring works nicely (for anyone who’s an insomniac).

2022-02-18 Person Search List failed to refresh