Date	11 Dec 2023
Authors	Cai Willis
Status	Resolved (Documenting)
Summary
Impact	Revalidation application was functionally unavailable

Table of Contents

Non-technical Description

A process which refreshes all of the data in the revalidation system was triggered accidentally on the production (“live”) environment - the intention was that it was only to be run on the staging (“testing”) environment. Once this process has started, it must be allowed to complete for data to be restored, and unfortunately it takes a very long time to complete.

Trigger

Accidental triggering of full production data sync

Detection

Noticed by developer

Resolution

Allowed process to complete naturally

Timeline

All times in GMT unless indicated

11 Dec 2023 14:45 Sync process accidentally triggered
11 Dec 2023 14:45 Impact on production environment noticed by developer
11 Dec 2023 16:41 Failure in last stage of process noticed
11 Dec 2023 16:50 Cause of failure identified
11 Dec 2023 16:57 Failure rectified, process resumed
11 Dec 2023 20:33 Process completed, system restored
11 Dec 2023 23:18 Recommendations Bug Discovered
12 Dec 2023 07:15 Users notified on Teams of Recommendation issues
12 Dec 2023 08:00 Patch to Recommendation UI deployed successfully, issue resolved ~ 08:00 Users notified of resolution

Root Cause(s)

Why did the connections list disappear on production

An unscheduled full production data-resynchronisation was triggered

Why was an unscheduled full production data-resynchronisation triggered?

Human Error
“preprod” and “prod” are not visually distinct enough?

Why was the sync not terminated?

No way of aborting the job or quickly reverting

Action Items

Action Items	Owner
Work out why the “traineeInfo” → “recommendationInfo” thing broke, and revert any patchwork done in the FE to compensate
Some mitigation for “accidental” prod triggers - what would this be?
Automated backups or similar “built-in” to the sync process so that it can be aborted and restored as required
Introduce batch messaging to speed up biggest bottleneck - judging by the work on the overnight doctor sync this could reduce the whole process down to a couple of hours

Versions Compared

Old Version 10

New Version 11

Key

Non-technical Description

Trigger

Detection

Resolution

Timeline

Root Cause(s)

Action Items

Lessons Learned

Page Comparison

Versions Compared

Old Version 10

New Version 11

Key

Non-technical Description

Trigger

Detection

Resolution

Timeline

Root Cause(s)

Action Items

Lessons Learned