Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Fix In Progress

Date

Authors

Cai Willis

Status

Resolved

Summary

Impact

Revalidation application was functionally unavailable

...

Non-technical Description

A process which refreshes all of the data in the revalidation system was triggered accidentally on the production (“live”) environment - the intention was that it was only to be run on the staging (“testing”) environment. Once this process has started, it must be allowed to complete for data to be restored, and unfortunately it takes a very long time to complete.

Trigger

Accidental triggering of full production data sync

...

  • 14:45 Sync process accidentally triggered

  • 14:45 Impact on production environment noticed by developer

  • 16:41 Failure in last stage of process noticed

  • 16:50 Cause of failure identified

  • 16:57 Failure rectified, process resumed

  • 20:33 Process completed, system restored

Root Cause(s)

Human Error

“preprod” and “prod” are not visually distinct enough?

...

Action Items

Action Items

Owner

Some mitigation for “accidental” prod triggers - what would this be?

Automated backups or similar “built-in” to the sync process so that it can be aborted and restored as required

Introduce batch messaging to speed up biggest bottleneck - judging by the work on the overnight doctor sync this could reduce the whole process down to a couple of hours

...

Lessons Learned