Replicating your production database is critical to the health of your company.
You use Redshift (or your data warehouse of choice) to backup data, run analytics, and deliver key business insights to the executive team.
There’s just one little problem to solve…
How do you actually move your data?
You have several options, including copying and exporting your entire database with each update.
But there’s a better option for data-driven companies without time to spare:
Change Data Capture (CDC).
# Why Use CDC to Replicate to Redshift?
Unlike other replication methods like Export & Import and Incremental SELECT & COPY, CDC is fast, comprehensive, and lightweight.
CDC is Fast. CDC only replicates changes made to your database since the last update. The Export & Import method replicates your entire database each time, which can take anywhere from 30 minutes to a few hours.
CDC is Comprehensive. Incremental SELECT & COPY is fast, but it doesn’t replicate changes to table schema or delete unused rows from your data warehouse. CDC replicates all changes to data and table schema.
CDC is Lightweight. Unlike Incremental SELECT & COPY, CDC does not impact the performance of your production database while replicating to your data warehouse.
However, there is one major drawback to using CDC — it’s a complex process that takes considerable resources to build and maintain.
But don’t worry — We’ll share a workaround to this challenge at the end of the article.
For now, let’s look at how to use CDC to replicate your database.
# How to Use Change Data Capture
CDC requires an application that reads, transforms, and imports streaming data from your production database to data warehouse.
You can build this application using open source tools such as mysql-replication-listener (MySQL), kodama (Ruby) and python-mysql-replication (Python). Or, you can use an off-the-shelf ETL like FlyData.
To build your own CDC application, follow these steps:
1. Capture Change Data
The majority of modern RBDMS (like MySQL) have two ways to capture change data: Transaction logs and triggers. Let’s look at each:
Transaction logs were originally built to sync replications of your production database. However, you can also use them to replicate to an outside data warehouse.
Each RDBMS has a proprietary transaction log, which makes it difficult to build and maintain this replication method. For example, MySQL’s transaction log is called the Binary Log (binlog), while PostgreSQL uses Write-Ahead Logging (WAL).
The other challenge with using transaction logs is that there is little documentation available. As we mentioned, they were originally built for internal use only.
Finally, transaction logs are almost always in binary format, which limits the type of data it can replicate without transformation.
You can also generate a changelog of your own with triggers. Triggers are hooks that fire before or after certain events (e.g., “after insert” or “before update”).
The big advantage of triggers is that they can be managed at the SQL level. The downside is that triggers can have a major impact on the performance of your production database. Triggers also need to be defined separately for each table.
For these reasons, we recommend using transaction logs for capturing change data.
Step 2: Transform Data Change
Captured change data then needs to be transformed into a format that can be uploaded to your data warehouse. The format will depend on what destination database you use. Some data warehouses will require multiple transformations to your original data.
Here are a few examples of differences you’ll come across when replicating MySQL to Amazon Redshift:
|Case sensitive table/column names||Yes||No|
|TIME column value beyond 23:59:59||Yes||No|
|VARCHAR length||Character Length||Byte Length|
|Date ‘0000-00-00’||Allowed||Not Allowed|
|BINARY, VARBINARY||Supported||Not Supported|
|ENUM, SET||Supported||Not Supported|
Step 3: Upload Change Data
Finally, it’s time to upload your change data to your destination. If your destination is another RDBMS, then uploading is as simple as running a few SQL statements.
However, if your destination is a data warehouse like Redshift, SQL statements are not an ideal option from a performance standpoint.
Here’s how to upload change data to Redshift and other BI data warehouses:
Use Special Upload Commands. Instead of using SQL statements, you’ll want to use special commands designed for data upload to that specific warehouse. For example, you’ll use the COPY command for Redshift.
Handle UPDATES and DELETES Separately. Upload commands like COPY only support data INSERTs, not DELETEs. That means you’ll have to handle UPDATE and DELETE changes separately. With Redshift, there’s a shortcut for this step: UPSERT commands. The UPSERT command handles UPDATE and INSERT simultaneously.
Table Schema Changes. In addition to capturing data changes, you’ll also need to deal with table schema changes, such as adding a new column. This requires capturing DDL statements (e.g. ALTER TABLE), translating it to the destination database’s SQL statement, and executing the statement.
Error Handling. Error handling is crucial when using CDC. Any mistakes in applying data changes can destroy the integrity of your entire data set.
We can write a whole new post on error handling, but in short, you just need to be “aware” of any errors that occur.
For example, let’s say one INSERT fails due to some data type error. Let’s also assume that the very next change was an UPDATE to that record that just failed to be inserted.
If you’re unaware of the initial error, then the next UPDATE change will error out as well. You need to recognize errors immediately as they happen, then buffer any subsequent change data, wait until the initial error is corrected, then automatically resume the syncing of all the subsequent change data.
Now imagine doing this for billions of records across hundreds of tables! The task is truly daunting, but with the right tools, you can be confident in your CDC replication solution.
# Simplify CDC Replication with FlyData
Change Data Capture is the best way to replicate data to your data warehouse, but it’s also the most complex. That’s why we built FlyData.
FlyData uses the CDC method to replicate data to Redshift in near real-time.
Unlike the CDC method, though, FlyData is a turnkey solution for replicating to Redshift. It manages the entire replication process, so you don’t have to. You can even synchronize multiple databases to Redshift at the same time.
In addition, FlyData automatically maps data types into formats used by Redshift. This eliminates the need for your team to do this manually.
Finally, FlyData’s world-class, 24/7 customer support will proactively monitor your pipeline to ensure you’re always up-and-running.
FlyData isn’t the only ETL-as-a-Service on the market, but it is the simplest and most reliable ETL.
Are you ready to solve your data replication problem for good?