/
Change Data Capture (CDC)

Change Data Capture (CDC)

What

CDC's like the name suggests, capture data changes that happen in a database and sends it out to other systems ready for those systems to consume and react.

Why

In distributed computing, you often want to let other components know when something has occurred. You quickly get a choice of the Orchestrated or Choreographed models of this. Lessons learnt from other systems show that the Orchestrated model doesn't scale and that it becomes difficult to maintain when you scale out.

So this leaves us with the Choreograph model, which effectively means the usage of some form of Messaging system but messaging systems can be difficult to work with if you've have a large amount of varying event's where you'll effectively run application logic, save to a database and then send a message to the message bus (have this multiplied in many places).

This approach leaves us with a number of issues. If you choose to send a message straight after the write to the database, you'll end up with the double write problem. What will you do if the message fails to be sent? do you ignore, potentially losing the message and all the events that react to it? or do you roll back the write to the DB?

 

Theres is also the issue of consistency, what happens when the write to the database is slower than the processing of the message? does the processor need to read highly consistent data?

 

This is where having a CDC works well. If your wanting to react to many events that happen to your data, then you will not need to write effectively duplicated code for all system flows.

If you're reading atomically committed data then you're not not going to be running in the same transaction, so no issues if failure occurs, you can retry. You’re also not going to have timing issues as the data is committed by the time the processors are running.

When

So systems that are becoming eventful and beginning to scale are great for such systems. It also; being a downstream product, doesn't require the any changes to any existing systems (with the exception of the DB to output a bin log)

How

So there are many CDC's on the market that can capture changes to the DB. For our particular use case requires support for both Mysql as well as RabbitMQ.

We're POC'ing with Maxwell's Daemon which supports reading Mysql bin logs as well as a number of messaging systems -RabbitMQ being one of them.

The actual work is to change Mysql settings to output the bin log in 'row' format and to spin up a docker container that has access to both Mysql and the messaging system. This will the write messages to Rabbit in Json form with what the new data looks like as well as the old. It also has meta data such as database, table, timestamp etc which processors could use if they wish.

 

Overview of CDC with the rest of the TIS system