2020-05-04 PersonOwner job failed affecting local office filters

Date

May 4, 2020

Authors

@Joseph (Pepe) Kelly

Status

 

Summary

PersonOwner job failed overnight.

Impact

People couldn’t get a filtered list by LocalOffice until 10:30

 

Root Cause(s)

  • Job failed due to a ‘duplicate key’.

  • Someone updated placement 1819582 for person 112107 at 01:06 am which (Pepe is speculating) caused the PersonOwner to be recalculated.

  • Re-runs were attempted while other users logged on and suffered from the same concurrency issue.

  • The sync job assumes that no other updates will be taking place and does nothing to try and lock the table during the update.

Trigger

  • Failed overnight ETL

Resolution

  • Manually truncated the table and re-ran; possibly lucky that no-one made a relevant update in that time window

Detection

  • Slack alerting

  • Shortly afterwards, users on Teams.

Action Items

Action Item

Type

Owner

Issue

Action Item

Type

Owner

Issue

Rewrite the job:

  • Idempotency

  • Update & Delete stale?

Prevention

 

https://hee-tis.atlassian.net/browse/TISNEW-4459

Improve sync-service alerting and/or resilience

Prevention

 

https://hee-tis.atlassian.net/browse/TISNEW-3124

 

 

 

 

 

 

 

 

 

 

 

 

Timeline

  • 07:55 Dev team noticed job failed and attempted to re-run. It failed again and we began investigating.

  • 08:13 Chris Nowak - User noticed that TIS was missing data - person search

  • 08:18 - Feedback to uses - We are looking into it from Simon

  • 08:27 - Firefire channel created by Simon

  • 09:01 - TIS Sync started again

  • 09:02 - Service Status updated by Phil

  • 09:57 - Sync [PersonPlacementEmployingBodyTrustJob] completed successfully.

  • 10:25 - last job finished in our notifications channel

  • 10:29 - NDW ETL Started again

  • 10:35 - Service Status - changed back to green / all OK