MySQL is the most popular open source cloud database in the world, and for good reason. It’s powerful, flexible, and extremely reliable. Tens of thousands of companies use MySQL to power their web-based applications and services every day.
But when it comes to data analytics, it’s a different story. MySQL is quickly bogged down by even the smallest analytical queries, putting your entire application at risk of crashing. As one FlyData customer said to us, “I have nightmares about our MySQL production database going down.”
That’s why so many companies turn to Amazon Redshift to complement MySQL.
Redshift is built to handle petabytes of data and deliver analytics in a fraction of the time. When you pair MySQL and Redshift, you eliminate the risk of crashing your production database while running queries.
There are several ways to replicate your MySQL data to Redshift. But first, let’s dig a little deeper into why you should replicate your MySQL database to Redshift:
# Why Replicate Data from MySQL to Redshift
Many companies that use MySQL to power their web applications choose Redshift for data analytics. There are several reasons why you should do this, too:
Maintain application performance. As we’ve already mentioned, running analytical queries on your production MySQL database can have a severe impact on its performance. It could even cause it to crash. Analytical queries are incredibly resource-intensive and require dedicated computing power.
Analyze ALL of your data. As an OLTP database, MySQL is designed for transactional data like customer records and financial data. However, you want to draw insights from your entire set of data, including the non-transactional type. You can use Redshift to capture and analyze all of your data in one place.
Faster analytics. Redshift is a Massively Parallel Processing (MPP) data warehouse, which means it can crunch huge sets of data in a fraction of the time. MySQL, on the other hand, struggles to scale to the compute power needed for large, modern analytical queries. Even a MySQL replica database will struggle to achieve the same speed as Redshift.
Scalability. MySQL was designed to work on a single-node instance, not modern distributed cloud infrastructure. Therefore, scaling beyond a single node requires time- and resource-intensive techniques like sharding or master-node setup. All of this slows down the database even further.
# Four Ways to Replicate MySQL to Redshift
Because of MySQL’s inherent weaknesses, many companies replicate data to Redshift for their analytics needs. There are 4 ways to accomplish this:
- Import & Export
- Incremental SELECT & COPY
- Change Data Capture (CDC) with Binlog
- FlyData ETL
1. Import & Export
The simplest way to replicate to Redshift is to export your entire MySQL data. However, this is also the least efficient method. There are three steps:
To start, export data using MySQL’s mysqldump command. A typical mysqldump command looks like this:
$ mysqldump -h yourmysqlhost -u user mydatabase mytable1 mytable2 --result-file dump.sql
The output of this command is your MySQL SQL statement. You cannot run the SQL on Redshift as is — you’ll have to transform the statement into a format suitable for Redshift import.
For the best upload performance, convert your SQL statement into TSV (tab separated values) format. You can do this by using the Redshift COPY command.
The COPY command converts your SQL statement into TSV format. Then it batch uploads the files into a Redshift table in Amazon S3. For example, a row of data in your MySQL dump would look like this:
mysql> INSERT INTO `users` (`id`, `firstname`, `lastname`, `age`) VALUES (1923, ‘John’, ‘Smith’, 34),(1925,’Tommy’,’King’);
Using COPY, it will be transformed into this:
Note that values are separated by a tab character (\t)
You may also have to convert data values to be Redshift compatible. This is because MySQL and Redshift support different column and data types.
For example, the DATE value ‘0000-00-00’ is valid in MySQL, but will throw an error in Redshift. You have to convert the value into an acceptable Redshift format, like ‘0001-01-01.’
After you’ve transformed your MySQL statement, the last step is to import it from S3 to Redshift. To do this, simply run the COPY command:
COPY users FROM 's3://my_s3_bucket/unload-folder/users_' credentials 'aws_access_key_id=your_access_key;aws_secret_access_key=your_secret_key';
Downsides of Import & Export
Although Import and Export is the simplest way to replicate to Redshift, it is not ideal for frequent updates.
For example, it takes roughly 30 minutes to export 18GB of data from MySQL over a 100Mbps network. It takes another 30 minutes to import that data into Redshift. This assumes you experience zero connection issues during import or export, which would force you to start the process over.
A more efficient method of replicating MySQL to Redshift is Incremental SELECT & COPY.
2. Incremental SELECT & COPY
If the Import & Export is too slow for your needs, incremental SELECT & COPY might be your answer.
The SELECT & COPY method only updates the records that have changed since the last update. This takes considerably less time and bandwidth compared to importing and exporting your entire dataset. SELECT & COPY enables you to sync MySQL and Redshift much more frequently.
To use incremental SELECT & COPY, your MySQL table has to meet a couple of conditions:
Table must have an updated_at column, whose timestamp gets updated every time the row is changed. Table must have a unique key or keys
Like Import & Export, there are three steps to this method:
Incremental SELECT exports only the rows that have changed since the last update. The SELECT query you run on MySQL looks like this:
SELECT * FROM users WHERE updated_at >= ‘2016-08-12 20:00:00’;
Save the result to a file for transformation.
This transformation step is the same as the Export & Import method. Transform the MySQL data into TSV format for Redshift.
At this point, your MySQL TSV file includes both updated rows and newly inserted rows. You cannot simply run a COPY command straight to your destination Redshift table. This would cause the updated rows to be duplicated.
To avoid duplicate rows, use the DELSERT (DELete + inSERT) technique:
Create a temporary table on Redshift with the same definition as the destination table.
Run COPY command to upload data to the temporary table.
Delete rows from the destination table which also exist in the temporary table. It will look like this:
DELETE FROM users USING users_staging s WHERE users.id = s.id;
Where “id” is the unique key of the table.
- Finally, Insert rows from the temporary table to the destination table:
INSERT INTO users (id, firstname, lastname, updated_at) SELECT id, firstname, lastname, updated_at FROM users_staging s;
Downsides of SELECT & COPY
Incremental SELECT & COPY is more efficient than Export & Import, but it has limitations of its own.
The main problem is that rows deleted from your MySQL table stay in Redshift indefinitely. This isn’t an issue if you want to keep historical data on Redshift while purging old data from MySQL. Otherwise, deleted rows in Redshift can cause major headaches during data analysis.
Another drawback of this method is that it doesn’t replicate table schema changes. When a column is added or dropped from the MySQL table, you need to make the corresponding change on the Redshift table manually.
Finally, the query used to pull updated rows from a MySQL table can affect the performance of your MySQL database.
If any of these drawbacks are a dealbreaker, this next method is for you: Change Data Capture with Binlog.
3. Change Data Capture with Binlog
Change Data Capture (CDC) is a technique that captures changes made to data in MySQL and applies it to the destination Redshift table. It’s similar to incremental SELECT & COPY in that it only imports changed data, not the entire database.
Unlike Incremental SELECT & COPY, however, CDC allows you to achieve true replication of MySQL to Redshift.
To use the CDC method with a MySQL database, you must utilize the Binary Change Log (binlog). Binlog allows you to capture change data as a stream, enabling near real-time replication.
Binlog not only captures data changes (INSERT, UPDATE, DELETE) but also table schema changes such as ADD/DROP COLUMN. It also ensures that rows deleted from MySQL are also deleted in Redshift.
Getting Started with binlog
When you use CDC with binlog, you are actually writing an application that reads, transforms, and imports streaming data from MySQL to Redshift.
There is an open source library you can use to do this called mysql-replication-listener. This C++ library offers a streaming API to read data from MySQL binlog in real time. A high level API is also available for a couple of languages: kodama (Ruby) and python-mysql-replication (Python).
First, set the MySQL config parameters to enable binlog. The following is a list of parameters related to binlog:
log_bin = /file_path/mysql-bin.log
Parameter binlog_format sets the format how binlog events are stored in the binlog file. There are 3 supported formats: STATEMENT, MIXED and ROW.
STATEMENT format saves queries in the binlog files as is (e.g. UPDATE SET firstname=’Tom’ WHERE id=293;). Although it saves binlog file size, it has issues when used for replication.
For replication to Redshift, use ROW format.
ROW format saves changed values in the binlog files. It increases the binlog file size but ensures data consistency between MySQL and Amazon Redshift. log_bin sets the path where binlog files are stored. expire_logs_days determines how many days binlog files are kept.
Specify the tables you’d like to replicate in the replicate-wild-do-table parameter. Only those tables specified should go into the binlog files.
We recommend keeping binlog files for a couple days. This ensures you have time to address any issues that arise during replication.
If you use a MySQL replication slave server as the source, it’s important to specify the log-slave-updates to TRUE. Otherwise, data changes made on the replication master will not be logged in the binlog.
Also, your MySQL account needs to have the following privileges in order to perform replication related tasks:
- REPLICATION SLAVE
- REPLICATION CLIENT
- LOCK TABLES
2. Export & Transformation
When you use the binlog, “export” is really a real-time data stream of your MySQL binlog files. How the binlog data is delivered depends on the API you use.
For example, with Kodama, binlog data is delivered as a stream of binlog events.
Kodama lets you register event handlers for different event types (insert, update, delete, alter table, create table, etc). Your application will receive binlog events. It will then generate an output ready for Redshift import (for data changes) or schema change (for table schema changes).
The data change import is similar to the Transformation steps of our other replication methods. Unlike the others, however, binlog allows you to handle delete events. You need to handle delete events specifically to maintain Redshift Upload Performance.
Finally, it’s time to import your binlog data stream.
The problem is Redshift doesn’t have steaming upload functionality. Use the DELSERT import technique we outlined in the Incremental SELECT & COPY method.
Downsides of Binlog
Binlog is the ideal method of replication from MySQL to Redshift, but it still has downsides:
Building your CDC application requires serious development effort.
In addition to the data streaming flow we described above, you will have to build:
Transaction management. Track data streaming performance in case an error forces your application to stop while reading binlog data. Transaction management ensures you can pick up where you left off.
Data buffering and retry. Similarly, Redshift can become unavailable while your application is sending data. Your application needs to buffer unsent data until the Redshift cluster comes back online. If this step is done incorrectly, it can cause data loss or duplicate data.
Table schema change support. A table schema change binlog event (ALTER/ADD/DROP TABLE) comes as a native MySQL SQL statement which does not run on Redshift as is. To support table schema changes, you’ll have to convert MySQL statements to the corresponding Amazon Redshift statements.
If you don’t want to spend developer resources on your own CDC binlog application, there’s a fourth method: FlyData.
4. FlyData: MySQL to Redshift Replication in Minutes
FlyData’s ETL pipeline utilizes the MySQL binlog to replicate data to Redshift in near real-time.
Unlike the CDC method, FlyData manages the entire replication process, so you don’t have to. You can even synchronize multiple MySQL databases (and other types of databases as well) to Redshift at the same time.
Setting up FlyData is so simple, you can do it over your lunch break:
Enable binary logging on MySQL.
Provide FlyData with access information on your servers and with Amazon Redshift.
Recreate your MySQL tables on Amazon Redshift
When you’re ready, FlyData will replicate your MySQL data to Redshift.
Whenever a change is made in the binlog (data or schema change), FlyData will automatically replicate those changes to Redshift.
In addition, FlyData automatically maps MySQL data types into formats used by Redshift. This eliminates the need for your team to do this manually.
FlyData isn’t the only ETL-as-a-Service on the market. But it is the simplest and most reliable ETL.
Our world-class, 24/7 customer support will proactively monitor your pipeline to ensure you’re always up-and-running.
# Get more out of MySQL with Amazon Redshift
You rely on MySQL to power your business, but its limitations with data analytics are well-known. Redshift provides a simple, powerful solution to your BI needs. Together, MySQL and Redshift can push your business to the next level.
As you’ve seen, there are numerous ways to replicate data from MySQL to Redshift. Methods range from simple to complex, and painfully slow to near real-time. The method you choose depends on several factors:
- Replication frequency
- Size of your MySQL dataset
- Available developer resources
The fastest, truest replication method is Change Data Capture (CDC), which utilizes MySQL’s binlog. The downside is that it requires developer hours to build and maintain the application.
That’s where an off-the-shelf tool like FlyData comes in.
Use FlyData for fast, reliable replication from MySQL to Redshift, without the hassle and headache.