Date | |
Authors | Joseph (Pepe) Kelly John Simmons (Deactivated) Yafang Deng Reuben Roberts Jayanta Saha Edward Barclay |
Status | Resolved |
Summary | Person search sync job failed |
Impact | Up to 1,000 person records (out of 290,000) that weren’t findable on the Person search page from 01.41 - 03.24 |
Non-technical Description
We run a number of sync jobs overnight. This one failed (see ‘Impact', above) - another process was taking place that prevented it from successfully running.
...
garbage collection activity taking longer than expected and eating into the sync job schedule.
Detection
monitoring-prod Slack alert.
...
Resolution
re-running the job as soon as it was noticed.
...
Action Items
Action Items | Owner |
---|---|
|
...
Lessons Learned
Its good to retry when you fail!
Even highly available systems have issues.
Task-based components could do with a bit more defensive development (around retries, consider things other than the ‘happy path’).
Our monitoring works nicely (for anyone who’s an insomniac).