This article is a sequel to an earlier article: FlyData Anatomy Series: FlyData Cloud, Part 1. In this series, we are walking through the data flow to show how FlyData processes your data into Amazon Redshift.
# From Data Extraction to Data Processing
In last week’s article, we briefly went over what is processed in the FlyData Cloud. They were:
- Allocation of workload
- Validation of data types
- Conversion of data to an Amazon Redshift compatible format (we use TSV)
- Saving data to S3
- Any tracking related to the upload and transformation process
- Handling and management of COPY commands
- Error Handling
Let’s take a look at each of them below.
# Receiving Data and Allocating Workload
The first thing that happens in the Data Extraction process is the receiving of data by the FlyData Cloud. Once the data is sent to the FlyData Cloud using SSL, we use a Round-Robin DNS to allocate the workload to proxy servers, which further allocates it to multiple data processing servers.
# Validation and Transformation
After receiving data from the proxies, the servers begin the process of transforming the data into an Amazon Redshift-compatible format. To do so, the servers will first check the table schema of the target Redshift table by running a query against the target Redshift cluster. Using this schema information, it will then perform certain validations on the data (e.g., checking the data type), so that it can minimize potential errors down the road. Once the data is validated, the data is converted to a TSV format, so that it can be loaded to Amazon Redshift.
# Saving to S3
The fastest way to load data into Amazon Redshift is through running COPY commands. FlyData organizes the data, compresses the files, and then saves the TSV data into an S3 bucket. It then updates its internal tracking mechanism so that COPY commands can be run later, at the right timing.
# Managing COPY Commands
When performing replication, the sequence of the data is really important. To make sure that we load data to Amazon Redshift in the right order, FlyData checks the metadata that is added in the previous step (in the FlyData Transport Format). Then FlyData will use COPY commands to load data in this order. This setup allows FlyData to retain the order of incoming data, so that it can upload to your Redshift cluster in the right sequence. The final COPY step will load the data to Amazon Redshift. FlyData uses internal hooks to trigger COPY commands. Whenever data is saved to S3 successfully, this will trigger a COPY command, in order.
We hope we were able to describe the FlyData Cloud process in detail. The hardest part in the process is to retain data integrity in an environment where networks could be unstable (Here’s our write-up on Change Data Capture), and parallelization. If you have any questions regarding this process, please feel free to reach out to us via email@example.com. Thanks for reading!