# FlyData Anatomy Series Part 1: The Data Extraction Process
Streamlining data into Amazon Redshift could be a challenge. At FlyData, we strive to make data loading as easy as possible, so that your data is always available in your Amazon Redshift cluster for you to query. In this series, we’d like to walk you through the data journey, from data generation all the way to reporting, and how FlyData fits in in each of the steps. We will also share some of the approaches we have taken and the options you have when choosing your data architecture with FlyData. Let’s start with the very first step: the FlyData Agent.
# From Data Generation to Data Extraction
In the overall picture, getting data into Amazon Redshift follows the typical Extract-Transform-Load (ETL) process. You first figure out what data you want to load, you extract it from your data source, you transform it to fit your data destination, and finally load it into your data destination. On top of that, you’ll have other processes like parallelization, validation, and error handling, that are integral for ensuring accurate and efficient data transfer. These are the things that are simple in theory, but get quite complex in practice, especially when implementing a solution that works in the cloud where issues like network errors are so common. The very first process starts with figuring out what data to load. FlyData takes the Change Data Capture (CDC) approach. CDC minimizes the amount of data to load, as well as the time associated with it. It does this by identifying the changes made to your data source, and uploading just that change data. Hence, it is perfect for use cases where you need your application database(s) and data warehouse to be kept in sync on a constant basis. To identify the changes made, we need a process that monitors the data source. That way, anytime a change is made, the process will pick up this change data and prepare it for the subsequent transfer. For this extraction process, we provide two options: (1) having FlyData directly extract from your data source; and, (2) using the installable FlyData Agent. Let’s take a look at each.
# The Data Extraction Process
Having FlyData Extract From Your Data Source
The default way to extract data from your data source is to have FlyData extract it directly. The advantages for this option is that there is less work on the customers’ end, since FlyData manages everything related to the extraction process. The potential downside is that this option will not work for customers who have security policies that don’t allow external access to their MySQL servers.
Using the Installable FlyData Agent
The other option is to install and use the FlyData Agent. This is the only option we support for FlyData Autoload. By configuring and installing the Agent, it will monitor your data source, and send any changes made to that data source for subsequent processing. The downside for this is that you need to install the Agent, albeit this being very lightweight code. The Agent runs on Ubuntu/Debian 12.04+ and on CentOS/Redhat 5+. Illustration of Each Option
Having FlyData Extract Directly
Using the FlyData Agent
No need to manage the Agent and its server
Any issues to the Agent are managed by FlyData
Some savings on server costs related to running the Agent
No need to expose connection to MySQL to FlyData
Need to expose MySQL port to FlyData
Need to manage the Agent process
Need to troubleshoot the Agent process if any client-side issues happen.
# Internals of the Extraction Process
When sending data across different applications over the internet, there are connection issues and application availability issues that can happen at any time. Once such issues happen, there is a risk of the data not actually reaching where it was intended to reach. To prevent this, the sending process needs to halt, and buffer any data that accumulates during that time. It also needs a mechanism to automatically resume the data sending once the connection is back up. And this needs to seamlessly scale. This is where Fluentd came into play.
Fluentd is an open source project built for data collection, which was started by TreasureData. Fluentd provided the bare bones for what we wanted to accomplish, so we continued from there to create an optimized solution for continuous data sending and replication to Amazon Redshift. To start, we built a custom plugin that converts data into an Amazon Redshift compatible format so that we can load data to it.
FlyData Transport Format
Next, we created the FlyData Transport Format and a process to convert data to this format. The format adds some metadata so that we can accurately track whether or not particular data had been transferred to the destination successfully. Think of it as a Fedex tracking number, but applied to billions of records of data. Without this, no one can be certain whether or not the data actually made it to the destination. These are areas where data loss or duplication typically occur, so we put extra focus on implementing a scalable solution that mitigates such issues. The FlyData Transport Format also provides a database-agnostic data format so that we can quickly add on new data sources in the future.
Sync MySQL Redshift Plugin
For our FlyData Sync product line in particular, we had the unique challenge of capturing change data made to a relational database. Fluentd only read from flat files, so we needed to create a custom solution around this. To implement data replication from MySQL to Amazon Redshift, we’ve created a proprietary MySQL listener plugin, which captures changes made on the MySQL binary log (binlog). This plugin looks at the MySQL binlog and converts this binary data into the FlyData Transport Format, so that it can be converted to an Amazon Redshift compatible format (we use TSV) down the line. Processing binary data, in particular, was not easy, but with our implementation we were able to realize the goal of near-real time data replication. Another unique requirement to realize replication was the order of the data. In replication, the order in which the data comes in is of great importance. For example, if the order of an UPDATE query comes in incorrectly to your data destination, then there will be an inconsistency in data between your data source and data destination. As Fluentd did not have resequencing logic built into it, we added functionality that reorders data to their initial order before loading them into Amazon Redshift. We’ll talk about this more in a future article in this series, as it relates heavily to the next step: the FlyData Cloud. For this article, we will just mention that in order to achieve this resequencing, we utilize the metadata in the FlyData Transport Format.
With the above implementations, the FlyData Agent process was able to make quite notable achievements:
- FlyData Autoload: If you have a web application across 100 different servers and want to perform log analysis, you can have all your log data in Amazon Redshift and query them in one place. In near real-time.
- FlyData Sync: Using FlyData Sync, we can achieve near real time replication from MySQL to Amazon Redshift, including DDL statements.
Through this article, we hope we were able to shed some light on the internals of FlyData and how we manage data integration to Amazon Redshift. In subsequent parts of this series, we’ll add information on how the data collected by the data extraction process would make its way eventually to Amazon Redshift.