Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Date

Authors

Joseph (Pepe) Kelly John Simmons (Deactivated) Yafang Deng Reuben Roberts Jayanta Saha Edward Barclay

Status

Resolved

Summary

Person search sync job failed

Impact

Up to 1,000 person records (out of 290,000) that weren’t findable on the Person search page from 01.41 - 03.24

Non-technical Description

We run a number of sync jobs overnight. This one failed (see ‘Impact', above) - another process was taking place that prevented it from successfully running.

...

  • garbage collection activity taking longer than expected and eating into the sync job schedule.

Detection

...

Resolution

  • re-running the job as soon as it was noticed.

...

Action Items

Action Items

Owner

  •  Refactor the sync job to be robust enough to retry on error - a spike ticket to look at the options?
    - task executor like we do elsewhere in TIS?
    - REST client retry?
    - Spring component for retrying method calls (configurable) - example in Reval (thanks Uzair)?

Reuben Roberts

Jira Legacy
serverSystem JIRA
serverId4c843cd5-e5a9-329d-ae88-66091fcfe3c7
keyTIS21-2698

...

Lessons Learned

  • Its good to retry when you fail!

  • Even highly available systems have issues.

  • Task-based components could do with a bit more defensive development (around retries, consider things other than the ‘happy path’).

  • Our monitoring works nicely (for anyone who’s an insomniac).