Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • 14:45 Sync process accidentally triggered

  • 14:45 Impact on production environment noticed by developer

  • 16:41 Failure in last stage of process noticed

  • 16:50 Cause of failure identified

  • 16:57 Failure rectified, process resumed

  • 20:33 Process completed, system restored

  • 23:18 Recommendations Bug Discovered

  • 07:15 Users notified on Teams of Recommendation issues

  • 08:00 Patch to Recommendation UI deployed successfully, issue resolved ~ 08:00 Users notified of resolution

Root Cause(s)

Why was an unscheduled full production data-resynchronisation triggered?

  • Human Error

  • “preprod” and “prod” are not visually distinct enough?

Why was the sync not terminated?

  • No way of aborting the job or quickly reverting

...

Action Items

Action Items

Owner

Work out why the “traineeInfo” → “recommendationInfo” thing broke, and revert any patchwork done in the FE to compensate

Some mitigation for “accidental” prod triggers - what would this be?

Automated backups or similar “built-in” to the sync process so that it can be aborted and restored as required

Introduce batch messaging to speed up biggest bottleneck - judging by the work on the overnight doctor sync this could reduce the whole process down to a couple of hours

...