Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 9 Next »

Date

Authors

Cai Willis

Status

Resolved (Documenting)

Summary

Impact

Revalidation application was functionally unavailable

Non-technical Description

A process which refreshes all of the data in the revalidation system was triggered accidentally on the production (“live”) environment - the intention was that it was only to be run on the staging (“testing”) environment. Once this process has started, it must be allowed to complete for data to be restored, and unfortunately it takes a very long time to complete.

Trigger

Accidental triggering of full production data sync

Detection

Noticed by developer


Resolution

Allowed process to complete naturally


Timeline

All times in GMT unless indicated

  • 14:45 Sync process accidentally triggered

  • 14:45 Impact on production environment noticed by developer

  • 16:41 Failure in last stage of process noticed

  • 16:50 Cause of failure identified

  • 16:57 Failure rectified, process resumed

  • 20:33 Process completed, system restored

  • 23:18 Recommendations Bug Discovered

  • 07:15 Users notified on Teams of Recommendation issues

  • 08:00 Patch to Recommendation UI deployed successfully, issue resolved ~ 08:00 Users notified of resolution

Root Cause(s)

Why was an unscheduled full production data-resynchronisation triggered?

  • Human Error

  • “preprod” and “prod” are not visually distinct enough?

Why was the sync not terminated?

  • No way of aborting the job or quickly reverting


Action Items

Action Items

Owner

Work out why the “traineeInfo” → “recommendationInfo” thing broke, and revert any patchwork done in the FE to compensate

Some mitigation for “accidental” prod triggers - what would this be?

Automated backups or similar “built-in” to the sync process so that it can be aborted and restored as required

Introduce batch messaging to speed up biggest bottleneck - judging by the work on the overnight doctor sync this could reduce the whole process down to a couple of hours


Lessons Learned

  • No labels