/
2020-04-14 20,000 Post Specialities Missing

2020-04-14 20,000 Post Specialities Missing

 

Date

Apr 14, 2020

Authors

@Philip Wilsdon (Unlicensed)

Status

In Progress

Summary

20,000 Post Specialities Missing

A load from a backup caused an additional issue

We used the most recent backup including Apr 13, 2020 and copied 60,000 PostSpecialty records from data as of Apr 9, 2020.



Impact

Users unable to modify data through Tuesday PM.

Data was not complete between Thursday PM and Tuesday evening.

 

 

Root Cause(s)

  • Existing Bug when updating a field in the specialty table whereby updating any of the fields e.g. specialty name, specialty group in that table results in an unexpected behaviour and deletes the PostSpecialty across all posts where the specialty is in use https://hee-tis.atlassian.net/browse/TISNEW-3854.

5 Whys

  1. We didn't fix a bug as we didn't realise the impact / severity

  2. We didn't prioritise / put it in a sprint

  3. Some specialty groups using the Front end interface

  4. Tech debt / bug with specialty groups (cascade type all) meant that the post specialities attached to specialities were updated to null

  5. Built with the wrong logic and testing (TDD/Unhappy path)

 

Trigger

Updating a number of specialties via. Front-end

 

Timeline

  • SpecialtyGroup field in Specialty Table on FrontEnd updated on Thursday morning starting at 11:16 09/04/2020 and ending at 14:11 09/04/2020. New SpecialtyGroup values created and assigned as part of this work.

  • Reported on teams at 11:45 on Apr 14, 2020 that specialities are missing

  • Reported to dev team at 12:36 as a Fire Fire issue

  • Fire Fire channel created and work begins to compare data as of Apr 9, 2020 with production

  • Apr 14, 2020 13:31 - Restore redirected from comparison db to prod

  • Intended to take backup and restore to a new database so we could compare. What actually happened - replaced the existing database - Script used - database name was at the top.

  • Attempted to load transactions since restore but could not be done because of writes since the restore.

  • Apr 14, 2020 16:23 - Data up to Tuesday AM restored (and just under 60k PostSpecialties re-established)

  • Apr 15, 2020 11:29 - James and Ashley re-tried bulk uploads that took place after the morning of 14/04 backup till the end of the day. A list of both successes/and errored uploads communicated to users to correct and re-try where necessary.

  • DB backups failed Apr 15, 2020 PM.

  • Apr 16, 2020 Purged bin log

  • Apr 16, 2020 02:?? backup failed due to lack of storage and made the database unavailable - see 2020-04-16 Users not able to login - too many DB connections

 

Where we got lucky

  • Got away with a bug for ages

Where we were unfortunate

  • No one updated the specialty and reported the error

  • So when we did in bulk - massive problem

  • User found the bug

Resolution

  • Restore the PostSpecialty data using the backup db data, and inform users to stop updating Specialty

  • Fix the PostSpecialty deletion bug as soon as possible, and run script to fix the rest data

PostSpecialty bug Analysis:

Reproduce the bug in local, and found on Admin/Speicalty page, no matter what field is updated, the related PostSpecialty data would be removed in the DB.

Went through the codes, noticed the Specialty entity has a OneToMany relationship to PostSpecialty table, but the cascade level was set to CacadeType.ALL, which means when the Specialty is updated/saved/deleted, the PostSpecialty would be updated on cascade. However, when the Specialty is updated on frontend, it won’t send any PostSpecialty to the backend, so the PostSpecialty set would be set to empty.

When we update Specialty, acually we don’t need to update anything in PostSpecialty, so the cascade level should be limited.

Detection

Reported on teams

 

Post Mortum Call

Initial investigation updated Apr 16, 2020

  • Specilty JAVA class - cascade type all - means when we update specilty the post specilty is updated as well

  • When saving the post, saves the post specialty to an “empty set”

  • If we update anything on the specility - posts speciality will be removed

  • Lots of work that needs doing for reference - access, moving and refactoring etc

  • Cascade remove and refresh so when we update the specilty page - do not need to update the post specility

  • How to test - integration test and ask James to do what he did again on Thursday morning

  • What other areas of TIS has this cascade ideas

  • Specility groups are valid and should be stored

  • API being called - need to provide parent and child entities

 

 

Action Items

Action Item

Type

Owner

Issue

Action Item

Type

Owner

Issue

Fix and Test

Code test and redo the same scenario on stage to make sure the same issue doesn't happen again

 

Bugfix


Vulgabee


https://hee-tis.atlassian.net/browse/TISNEW-3854

 

Search Codebase

Find anywhere in that might have cascade all.

Ticket up, Identify areas. Assess if critical and discuss actions

Prevent elsewhere

Vulgabee

https://hee-tis.atlassian.net/browse/TISNEW-4268

Move Specialty to Reference Service

 

Vulgabee

 

Restrict Access to Reference tables

 

Vulgabee

https://hee-tis.atlassian.net/browse/TISNEW-3887

Run Automated ‘e2e’ tests

Look at running automated e2e tests every night that reports any issues for the team to look at first thing in the morning.

Prevention

Vulgabee

 

Use a different partition for temporary backup files

Prevention

All Teams / DevOps

https://hee-tis.atlassian.net/browse/TISNEW-4251

Make database restores easier (Script it?)

Task

All Teams / DevOps

https://hee-tis.atlassian.net/browse/TISNEW-4266?searchSessionId=223caf6b-35e2-4947-b358-5fd28a4e9483&searchContainerId=10805&searchContentType=issue&searchObjectId=44956

Resolve issue with DB server config not being used

Reliability

All Teams / DevOps

https://hee-tis.atlassian.net/browse/TISNEW-4267?searchSessionId=9051ccf6-8d28-4fdf-b21b-2dd2d845c034&searchContainerId=10805&searchContentType=issue&searchObjectId=44957



 

Related content