2020-05-04 PersonOwner job failed affecting local office filters
Date | May 4, 2020 |
Authors | … @Joseph (Pepe) Kelly |
Status |
|
Summary | PersonOwner job failed overnight. |
Impact | People couldn’t get a filtered list by LocalOffice until 10:30 |
Root Cause(s)
Job failed due to a ‘duplicate key’.
Someone updated placement 1819582 for person 112107 at 01:06 am which (Pepe is speculating) caused the PersonOwner to be recalculated.
Re-runs were attempted while other users logged on and suffered from the same concurrency issue.
The sync job assumes that no other updates will be taking place and does nothing to try and lock the table during the update.
Trigger
Failed overnight ETL
Resolution
Manually truncated the table and re-ran; possibly lucky that no-one made a relevant update in that time window
Detection
Slack alerting
Shortly afterwards, users on Teams.
Action Items
Action Item | Type | Owner | Issue |
---|---|---|---|
Rewrite the job:
| Prevention |
| |
Improve sync-service alerting and/or resilience | Prevention |
| |
|
|
|
|
|
|
|
|
|
|
|
|
Timeline
07:55 Dev team noticed job failed and attempted to re-run. It failed again and we began investigating.
08:13 Chris Nowak - User noticed that TIS was missing data - person search
08:18 - Feedback to uses - We are looking into it from Simon
08:27 - Firefire channel created by Simon
09:01 - TIS Sync started again
09:02 - Service Status updated by Phil
09:57 - Sync [PersonPlacementEmployingBodyTrustJob] completed successfully.
10:25 - last job finished in our notifications channel
10:29 - NDW ETL Started again
10:35 - Service Status - changed back to green / all OK
Slack: https://hee-nhs-tis.slack.com/
Jira issues: https://hee-tis.atlassian.net/issues/?filter=14213