2020-04-14 20,000 Post Specialities Missing
Date | Apr 14, 2020 |
Authors | @Philip Wilsdon (Unlicensed) |
Status | In Progress |
Summary | 20,000 Post Specialities Missing A load from a backup caused an additional issue We used the most recent backup including Apr 13, 2020 and copied 60,000 PostSpecialty records from data as of Apr 9, 2020. |
Impact | Users unable to modify data through Tuesday PM. Data was not complete between Thursday PM and Tuesday evening. |
Root Cause(s)
Existing Bug when updating a field in the specialty table whereby updating any of the fields e.g. specialty name, specialty group in that table results in an unexpected behaviour and deletes the PostSpecialty across all posts where the specialty is in use https://hee-tis.atlassian.net/browse/TISNEW-3854.
5 Whys
We didn't fix a bug as we didn't realise the impact / severity
We didn't prioritise / put it in a sprint
Some specialty groups using the Front end interface
Tech debt / bug with specialty groups (cascade type all) meant that the post specialities attached to specialities were updated to null
Built with the wrong logic and testing (TDD/Unhappy path)
Trigger
Updating a number of specialties via. Front-end
Timeline
SpecialtyGroup field in Specialty Table on FrontEnd updated on Thursday morning starting at 11:16 09/04/2020 and ending at 14:11 09/04/2020. New SpecialtyGroup values created and assigned as part of this work.
Reported on teams at 11:45 on Apr 14, 2020 that specialities are missing
Reported to dev team at 12:36 as a Fire Fire issue
Fire Fire channel created and work begins to compare data as of Apr 9, 2020 with production
Apr 14, 2020 13:31 - Restore redirected from comparison db to prod
Intended to take backup and restore to a new database so we could compare. What actually happened - replaced the existing database - Script used - database name was at the top.
Attempted to load transactions since restore but could not be done because of writes since the restore.
Apr 14, 2020 16:23 - Data up to Tuesday AM restored (and just under 60k PostSpecialties re-established)
Apr 15, 2020 11:29 - James and Ashley re-tried bulk uploads that took place after the morning of 14/04 backup till the end of the day. A list of both successes/and errored uploads communicated to users to correct and re-try where necessary.
DB backups failed Apr 15, 2020 PM.
Apr 16, 2020 Purged bin log
Apr 16, 2020 02:?? backup failed due to lack of storage and made the database unavailable - see 2020-04-16 Users not able to login - too many DB connections
Where we got lucky
Got away with a bug for ages
Where we were unfortunate
No one updated the specialty and reported the error
So when we did in bulk - massive problem
User found the bug
Resolution
Restore the PostSpecialty data using the backup db data, and inform users to stop updating Specialty
Fix the PostSpecialty deletion bug as soon as possible, and run script to fix the rest data
PostSpecialty bug Analysis:
Reproduce the bug in local, and found on Admin/Speicalty page, no matter what field is updated, the related PostSpecialty data would be removed in the DB.
Went through the codes, noticed the Specialty entity has a OneToMany relationship to PostSpecialty table, but the cascade level was set to CacadeType.ALL, which means when the Specialty is updated/saved/deleted, the PostSpecialty would be updated on cascade. However, when the Specialty is updated on frontend, it won’t send any PostSpecialty to the backend, so the PostSpecialty set would be set to empty.
When we update Specialty, acually we don’t need to update anything in PostSpecialty, so the cascade level should be limited.
Detection
Reported on teams
Post Mortum Call
Initial investigation updated Apr 16, 2020
Only been investigating for like half a day
Specilty JAVA class - cascade type all - means when we update specilty the post specilty is updated as well
When saving the post, saves the post specialty to an “empty set”
If we update anything on the specility - posts speciality will be removed
Lots of work that needs doing for reference - access, moving and refactoring etc
Cascade remove and refresh so when we update the specilty page - do not need to update the post specility
How to test - integration test and ask James to do what he did again on Thursday morning
What other areas of TIS has this cascade ideas
Specility groups are valid and should be stored
API being called - need to provide parent and child entities
Action Items
Action Item | Type | Owner | Issue |
---|---|---|---|
Fix and Test Code test and redo the same scenario on stage to make sure the same issue doesn't happen again |
Bugfix |
|
|
Search Codebase Find anywhere in that might have cascade all. Ticket up, Identify areas. Assess if critical and discuss actions | Prevent elsewhere | Vulgabee | |
Move Specialty to Reference Service |
| Vulgabee |
|
Restrict Access to Reference tables |
| Vulgabee | |
Run Automated ‘e2e’ tests Look at running automated e2e tests every night that reports any issues for the team to look at first thing in the morning. | Prevention | Vulgabee |
|
Use a different partition for temporary backup files | Prevention | All Teams / DevOps | |
Make database restores easier (Script it?) | Task | All Teams / DevOps | |
Resolve issue with DB server config not being used | Reliability | All Teams / DevOps |
Slack: https://hee-nhs-tis.slack.com/
Jira issues: https://hee-tis.atlassian.net/issues/?filter=14213