2023-12-11 Unscheduled full production data-resynchronisation accidentally triggered
Date | Dec 11, 2023 |
Authors | @Cai Willis |
Status | Resolved |
Summary | |
Impact | Revalidation application was functionally unavailable |
Non-technical Description
A process which refreshes all of the data in the revalidation system was triggered accidentally on the production (“live”) environment - the intention was that it was only to be run on the staging (“testing”) environment. Once this process has started, it must be allowed to complete for data to be restored, and unfortunately it takes a very long time to complete.
Trigger
Accidental triggering of full production data sync
Detection
Noticed by developer
Resolution
Allowed process to complete naturally
Timeline
All times in GMT unless indicated
Dec 11, 2023 14:45 Sync process accidentally triggered
Dec 11, 2023 14:45 Impact on production environment noticed by developer
Dec 11, 2023 16:41 Failure in last stage of process noticed
Dec 11, 2023 16:50 Cause of failure identified
Dec 11, 2023 16:57 Failure rectified, process resumed
Dec 11, 2023 20:33 Process completed, system restored
Dec 11, 2023 23:18 Recommendations Bug Discovered
Dec 12, 2023 07:15 Users notified on Teams of Recommendation issues
Dec 12, 2023 08:00 Patch to Recommendation UI deployed successfully, issue resolved ~ 08:00 Users notified of resolution
Root Cause(s)
Why did the connections list disappear on production
An unscheduled full production data-resynchronisation was triggered
Why was an unscheduled full production data-resynchronisation triggered?
Human Error
“preprod” and “prod” are not visually distinct enough?
Why was the sync not terminated?
No way of aborting the job or quickly reverting
Why did the recommendations list disappear?
Because the DTO naming conversion (“traineeInfo” → “recommendationInfo”) in the integration service stopped working
Why did the DTO naming conversion in the integration service stop working?
This commit solved the datetime issue, but appeared to change the JSON mapping in the application, particularly for REST documentation, in such away that it didn’t play well with Camel
Action Items
Action Items | Owner |
|
---|---|---|
Fix the JSON mapping issue in recommendations, and revert any patchwork done in the FE to compensate | @Cai Willis | Done |
Automated backups or similar “built-in” to the sync process so that it can be aborted and restored as required | @Cai Willis | |
Introduce batch messaging to speed up biggest bottleneck - judging by the work on the overnight doctor sync this could reduce the whole process down to a couple of hours |
| Already have work lined up in addressing the raised action e.g GMC sync work |
Lessons Learned
Mistakes happen
Slack: https://hee-nhs-tis.slack.com/
Jira issues: https://hee-tis.atlassian.net/issues/?filter=14213