Scalability of Amazon Redshift

Our previous set of slides have gotten a bit of attention from a lot of people interested in big data. One of the things we have seen is that a lot of people are concerned about the time it took to load all of our data onto Redshift, specifically about the "17 hours for 1.2TB".

Redshift Can Scale

For our testing, we ran a single node XL instance, a multi node XL instance, and a 8XL multi node instance (it is not possible to choose a single node 8XL instance) to compare loading for 1.2TB of data and query speeds on that data. We saw this in our tests: For loading 1.2TB of data:

A single node XL instance took 17 hours
A multi node XL instance of two nodes took 10 hours
A multi node 8XL instance of two nodes took 2 hours
Load speeds are almost proportional to the number of nodes

Running identical queries:

A single node XL instance took 155 seconds
A multi node XL instance of two nodes took 55 seconds
A multi node 8XL instance of two nodes took 31 seconds
A query runs faster when there are more nodes but the performance does not rise in a linear fashion

These results are very interesting because loading speed increases by server nodes. Loading to clusters run on all instances in parallel. On the other hand, querying on muliple nodes is faster on multiple nodes than it is on single. This shows parallel processing succeeding in this range. In fact, a two node cluster is much faster than half the time of a single node. From this result, we can see that Redshift clusters are probably optimized for multiple node clusters.

Additional Thoughts

We realize 8XL instances cannot be used in a single node cluster. It is a restriction of AWS. This is a point we considered. It means that we need to use a 15 node XL instance before we can consider launching the 8XL instnace. Fortunately, AWS provides a way to upgrade your XL instance to an 8XL instance on the fly with just a few minutes of downtime.

Next Step

These results show how scalable Amazon Redshift is at both data loading and querying. There needs to be more experiments done to determine how they scale even more data (more than a few dozen TB of data). Next, we are planning to test various types of queries, including manipulating text columns and different types of compression.

Redshift Can Scale

A single node XL instance took 17 hours
A multi node XL instance of two nodes took 10 hours
A multi node 8XL instance of two nodes took 2 hours
Load speeds are almost proportional to the number of nodes

Running identical queries:

A single node XL instance took 155 seconds
A multi node XL instance of two nodes took 55 seconds
A multi node 8XL instance of two nodes took 31 seconds
A query runs faster when there are more nodes but the performance does not rise in a linear fashion

Additional Thoughts

Next Step

big data integration

Scalability of Amazon Redshift Data Loading and Query Speeds

Redshift Can Scale

Additional Thoughts

Next Step

Redshift Can Scale

Additional Thoughts

Next Step

Guide to Comma Separated Values in Data Integration

Understanding The 8 Different Types of Data Processing

Boosting Customer Experience Through Data Integration

Solutions

Support

Company

Language

Scalability of Amazon Redshift Data Loading and Query Speeds

Redshift Can Scale

Additional Thoughts

Next Step

Redshift Can Scale

Additional Thoughts

Next Step

Related Readings

Guide to Comma Separated Values in Data Integration

Understanding The 8 Different Types of Data Processing

Boosting Customer Experience Through Data Integration

Subscribe To The Stack Newsletter

Solutions

Support

Company

Language

Subscribe To
The Stack Newsletter