...
14:45 Sync process accidentally triggered
14:45 Impact on production environment noticed by developer
16:41 Failure in last stage of process noticed
16:50 Cause of failure identified
16:57 Failure rectified, process resumed
20:33 Process completed, system restored
23:18 Recommendations Bug Discovered
07:15 Users notified on Teams of Recommendation issues
08:00 Patch to Recommendation UI deployed successfully, issue resolved ~ 08:00 Users notified of resolution
Root Cause(s)
Why was an unscheduled full production data-resynchronisation triggered?
Human Error
“preprod” and “prod” are not visually distinct enough?
Why was the sync not terminated?
No way of aborting the job or quickly reverting
...
Action Items
Action Items | Owner | |
---|---|---|
Work out why the “traineeInfo” → “recommendationInfo” thing broke, and revert any patchwork done in the FE to compensate | ||
Some mitigation for “accidental” prod triggers - what would this be? | ||
Automated backups or similar “built-in” to the sync process so that it can be aborted and restored as required | ||
Introduce batch messaging to speed up biggest bottleneck - judging by the work on the overnight doctor sync this could reduce the whole process down to a couple of hours |
...