/
2018-12-21 NDW ETL created duplicate entries on NDW Prod

2018-12-21 NDW ETL created duplicate entries on NDW Prod

Date
 
AuthorsJayanta SahaJohn Simmons (Deactivated)Chris Mills (Unlicensed)
StatusResolved
SummaryTIS Prod → NDW Prod ETL created duplicate entries in the NDW D/B.
This occurred as a recent code change resulted in the ETL running from both Blue and Green (which aren't synced, so caused dupes)
ImpactNDW Prod D/B had every record duplicated.


Impact

  • When the TIS Prod → NDW Prod ETL started, data from both Blue and Green was transferred, resulting in a duplicate of each record in NDW. All reporting then required some NDW post-processing to account for removal of duplicates.

Root Causes

  • Recent TIS-DEVOPS deployment code changes which were kicking off the TIS-NDW-ETL running on both Blue and Green servers.

Trigger

  • NDW databases were getting duplicates in different tables. NDW team, or one of their downstream users noticed the problem and alerted TIS on Slack #tis-ndw-etl channel.

Resolution

  • Modification of the TIS-DEVOPS deployment code so that it runs only on Green.

Detection / Timeline

20-12-2018

  • User queried the NDW Team via #tis-ndw-etl channel regarding duplicate issues on the NDW D/B. DevOps started looking at the docker logs and pulled them in a file (as all Devs were on the visit to King's Hospital)

21-12-2018

  • Similar issues happened today so:
    1. Dev team checked the docker logs.
    2. Then ran queries to check if the row count matches in the TCS and NDW D/B tables. Row count did not match as expected.
    3. DevOps checked the tis_ndw_etl code version in the Docker of of both Stage and Prod (they were the same).
    4. Then checked the TIS-DEVOPS code base and found is running on both Blue and Green (without locking or checking so just repeats everything twice).
    5. Checked whether this was also the case on Stage (because NDW UAT that receives data from TIS Stage was not suffering the duplicate issue). It wasn't.
    6. Identified that major improvements needed making to the ETL in future (apart from the fixes for this specific issue). Jira ticket to follow.
    7. Missed removing it (what?) from Apps when resolving an array problem 10 days ago - indicating the problem has been around for 10 days but only picked up two days ago.
    8. Duplicates still exist, but are legacy duplicates inherited from Intrepid, so should be handled separately.

Action Items

  • Removed from Apps, so behaviour of TIS Prod –> NDW Prod ETL should match that of TIS Stage → NDW UAT ETL and not duplicate entries.
  • Ticket up fixing the ETL properly to address tech debt that makes the ETL far from optimal.
  • Follow up with NDW team to confirm problem was introduced 10 days ago, and to understand knock on implications of this in order to determine whether any further adjustments are required (e.g. for people who have put reports together based on double the records).
  • Understand precisely why TIS Stage → NDW UAT is not exhibiting the same problem - is Blue/Green synced correctly on Stage, and not on Prod?

Lessons Learned

What went well

  • Relatively quickly identified the problem.
  • Relatively quickly identified a quick fix to prevent problem occurring again this evening.
  • NDW also identified a quick fix at their end and applied it to fix problem today.

What went wrong

  • Speedy PR approval missed the underlying problem.
  • No alerts to indicate the problem had happened (or that it actually started happening 10 days ago).

Where we got lucky

  • NDW able to do fix on today's data.
  • Fix for future data was straightforward to implement.

Supporting information

  • To check docker logs use the following command    docker logs ndw-etl_ndw-etl_1 --tail 3000

  • #incident-ndw-20181221 channel in Slack