Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Root Cause(s)

  • Job failed due to a ‘duplicate key’.

  • Someone updated placement 1819582 for person 112107 at 01:06 am which (Pepe is speculating) caused the PersonOwner to be recalculated.

  • Re-runs were attempted while other users logged on and suffered from the same concurrency issue.

  • The sync job assumes that no other updates will be taking place and does nothing to try and lock the table during the update.

Trigger

  • Failed overnight ETL

Resolution

  • Manually truncated the table and re-ran; possibly lucky that no-one made a relevant update in that time window

Detection

  • Slack alerting

  • Shortly afterwards, users on Teams.

Action Items

Action Item

Type

Owner

Issue

Rewrite the job:

  • Idempotency

  • Update & Delete stale?

Prevention

Improve sync-service alerting and/or resilience

Prevention

Timeline

  • 07:55 Dev team noticed job failed and attempted to re-run. It failed again and we began investigating.

  • 08:13 Chris Nowak - User noticed that TIS was missing data - person search

  • 08:18 - Feedback to uses - We are looking into it from Simon

  • 08:27 - Firefire channel created by Simon

  • 09:01 - TIS Sync started again

  • 09:02 - Service Status updated by Phil

  • 09:57 - Sync [PersonPlacementEmployingBodyTrustJob] completed successfully.

  • 10:25 - last job finished in our notifications channel

  • 10:29 - NDW ETL Started again

  • 10:35 - Service Status - changed back to green / all OK