Lately, quite a few organizations have decided to speed up query performance with Amazon Redshift. Writing a MapReduce job, testing it, fixing the bugs, then waiting an hour or more as it is run against a Hadoop cluster can be an exercise in patience. Corporations, large and small, are opting for the high-performance results Redshift brings to the table. Amazon Redshift is a petabyte-scale, cloud-based data warehouse solution. Many organizations are turning to it for a cost-effective, fully managed, powerful solution to their data warehousing needs. Not only is it a hassle-free alternative to the competing solutions, but Redshift’s query performance is optimized for extremely fast data analysis. Redshift’s fast and powerful performance is due to five key enablers in its architectural arsenal. Massively parallel processing, columnar data storage, data compression, query optimization, and compiled code form the five pillars of its ultra-fast query performance. What are these individual technologies, and when combined, how do they achieve such accelerated query results? The key lies in the way each works in unison with the other to make Amazon Redshift the solution so many organizations are seriously considering.
The Five pillars of Amazon Redshift
- Massively parallel processing
- columnar data storage
- data compression
- query optimization
- compiled code
# 1. Massively Parallel Processing
Massively parallel processing (MPP) is a technology that uses multiple, independently operating processors to work on large datasets too big to efficiently manage. Systems that use MPP greatly speed up the processing time a query running against a large data warehouse will take if processed in the conventional way. If we were to take a simplified look at it, MPP works something like this: 1. A very large dataset is broken up into smaller, manageable pieces. 2. Each piece is handed off to an individual processor in the system to be worked on. There could be over a thousand processors all working on different parts of the same large dataset. 3. The individual processors communicate with each other through a messaging system. 4. Once all processors have completed the assigned work, the separate results are combined into one large resultset. MPP is also known as a “shared nothing” system. This is because each processor uses its own memory and operating system. It is for this reason the processing of each individual slice of data can be handled independently. This allows very fast, smooth processing without the possibility of a bottleneck inherent in symmetric processing (SMP) systems where memory is shared. Another performance advantage comes from a data optimization feature employed by MPP systems. The flow of data is monitored and coordinated, balancing and speeding up the processing further. All things considered, massively parallel processing is a critical component in the performance speed Amazon Redshift has become known for.
# 2. Columnar Data Storage
In a relational database, data stored in tables is arranged in rows corresponding to records. On the storage medium where the data resides, it is physically arranged row after row. In a columnar database, rather than rows, the data is physically stored in columns and aligned corresponding to individual records. In other words, say we were to take the 10th element of the first column. The 10th element of the second column would be an element in the same record. This may be difficult to conceptualize, but the following tables may clear things up.
Table A: Simple Database Table
Table B: Physical Storage
How it is Physically Stored
101, Sally, Jones; 102, Pete, Smith; 103, Julie, Roy;
101, 102, 103; Sally, Pete, Julie; Jones, Smith, Roy;
In columnar data storage, each data block holds almost three times more record values than row-based data storage does. Also, because each column holds the same type of data, a specific compression scheme complimenting columnar data storage can be used. Processing the same number of records will take only one-third the I/O and much less disk space as row-based storage would. Increased speed and reduced data storage are the two main benefits of databases using columnar data storage. This type of database architecture is often implemented with complementary architectures and technologies like MPP and data compression, and is another reason behind Redshift’s rapid query performance.
# 3. Data Compression
Like columnar data storage, data compression helps to further reduce disk I/O and storage requirements. The goal of data compression is to reduce the number of bits needed to store data. This is achieved either by removing statistical redundancies, or by eliminating repeating and redundant pieces of data. Since the compressed data takes up less space, disk I/O becomes much more efficient. When a query is run against a Redshift data warehouse, the compressed data is read into memory. Just the fact that it is a reduced version of the actual data helps the system allocate the minimum amount of resources in this step. This enables Redshift to focus resources more efficiently, and to where it is needed most. During the execution of the query, the data is uncompressed in memory. Since the whole process has become more efficient, the system can now concentrate more resources toward the actual execution of the query, producing the resultset with greater efficiency and speed.
# 4. Query Optimizer
A query optimizer is a mechanism used in many database management systems. It analyzes a query and generates one or more query execution plans. Depending on resources, connections, possible bottlenecks, and a number of other factors, the query optimizer selects the most efficient query plan. This ensures the best possible query performance at the time of execution. The Redshift query optimization engine is MPP-aware and tuned to work best with columnar data storage. As with the other technologies and architectures that make up the five pillars of Redshift’s performance and speed, the query optimizer is calibrated to work in unison with its coexisting components.
# 5. Compiled Code
Generally, there are two different kinds of programming languages. Both compiled languages and interpreted languages attempt to achieve the same goals, but usually with very different methods and performance results. Compiled: When a developer writes code and runs it through software known as a “compiler,” the result is an executable machine language file. This file can be run independently, without the help of any supporting software programs other than the OS. It can be installed on other machines with the same architecture and run there as well. Interpreted: Code written with an interpreted language requires the help of a software program known as an “interpreter” to run. Each time the code is run, the assistance of the interpreter is required. This can be a burden on the system in terms of added overhead. Both compiled and interpreted languages have their advantages. However, when it comes to speed, there is no doubt that compiled code is much faster because there is no overhead involved with an additional program like an interpreter. Amazon Redshift uses compiled code, cutting out the need for an interpreter. The first time a query is run, Redshift compiles it and caches the compiled code to be shared across all sessions of the same cluster. This reduces execution time of queries, greatly adding to performance speed.
# Final Word
These five pillars of Amazon Redshift’s exceptional query performance are making the IT world sit up and take notice. The speed at which Redshift runs its queries against petabyte-scale data warehouses compared to competing solutions has many organizations impressed and considering changes. These five enablers, not only do their part to speed up Redshift’s query execution, but they have also been engineered to seamlessly fit together, helping to complimenting each other. It is for these reasons data warehouse managers are rethinking the technologies and solutions used in their organizations, with Amazon Redshift coming out the winner.