Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Current »

Date

Authors

Joseph (Pepe) Kelly John Simmons (Deactivated) Yafang Deng Reuben Roberts Jayanta Saha Edward Barclay

Status

Resolved

Summary

Person search sync job failed

Impact

Up to 1,000 person records (out of 290,000) that weren’t findable on the Person search page from 01.41 - 03.24

Non-technical Description

We run a number of sync jobs overnight. This one failed (see ‘Impact', above) - another process was taking place that prevented it from successfully running.

We re-ran the job shortly afterwards and it completed successfully.

We investigated what tripped it up and will work to mitigate a recurrence.


Trigger

  • garbage collection activity taking longer than expected and eating into the sync job schedule.

Detection


Resolution

  • re-running the job as soon as it was noticed.


Timeline

  • 2022-02-18|01:41: Sync [Person sync job] failed with exception…” message in Slack monitoring-prod channel.

  • 2022-02-18|03:13: Team member restarted the job when they notice the issue.

  • 2022-02-18|03:24: Rerun job completed successfully.


Root Cause(s)


Action Items

Action Items

Owner

  • Refactor the sync job to be robust enough to retry on error - a spike ticket to look at the options?
    - task executor like we do elsewhere in TIS?
    - REST client retry?
    - Spring component for retrying method calls (configurable) - example in Reval (thanks Uzair)?

Reuben Roberts


Lessons Learned

  • Its good to retry when you fail!

  • Even highly available systems have issues.

  • Task-based components could do with a bit more defensive development (around retries, consider things other than the ‘happy path’).

  • Our monitoring works nicely (for anyone who’s an insomniac).

  • No labels