Date	Feb 18, 2022
Authors	@Joseph (Pepe) Kelly @John Simmons (Deactivated) @Yafang Deng @Reuben Roberts @Jayanta Saha @Edward Barclay
Status	Resolved
Summary	Person search sync job failed
Impact	Up to 1,000 person records (out of 290,000) that weren’t findable on the Person search page from 01.41 - 03.24

Non-technical Description

We run a number of sync jobs overnight. This one failed (see ‘Impact', above) - another process was taking place that prevented it from successfully running.

We re-ran the job shortly afterwards and it completed successfully.

We investigated what tripped it up and will work to mitigate a recurrence.

Trigger

garbage collection activity taking longer than expected and eating into the sync job schedule.

Detection

monitoring-prod Slack alert.

Resolution

re-running the job as soon as it was noticed.

Timeline

2022-02-18|01:41: “Sync [Person sync job] failed with exception…” message in Slack monitoring-prod channel.
2022-02-18|03:13: Team member restarted the job when they notice the issue.
2022-02-18|03:24: Rerun job completed successfully.

Root Cause(s)

see TIS21-2697.

Action Items

Action Items	Owner

Action Items	Owner
Refactor the sync job to be robust enough to retry on error - a spike ticket to look at the options? - task executor like we do elsewhere in TIS? - REST client retry? - Spring component for retrying method calls (configurable) - example in Reval (thanks Uzair)?	@Reuben Roberts https://hee-tis.atlassian.net/browse/TIS21-2698

Lessons Learned

Its good to retry when you fail!
Even highly available systems have issues.
Task-based components could do with a bit more defensive development (around retries, consider things other than the ‘happy path’).
Our monitoring works nicely (for anyone who’s an insomniac).

TIS21 Confluence Space

2022-02-18 Person Search List failed to refresh