As a data-driven organization, your choice of data warehouse is crucial. And more likely than not, you’re considering two popular options: Google BigQuery and Amazon Redshift.
You may already be locked into a cloud platform. In that case, this is not much of a decision for you. BigQuery only works on Google Cloud, and Redshift only works with AWS.
But if you’re looking to make a change in the data warehouse department, look no further than this guide.
BigQuery and Redshift are both world-class data warehouses, but they are very different.
We’ll look at the most important criteria for choosing your data warehouse, including:
Cloud platform support
Business use cases
Let’s dig in!
# TL;DR: Comparing BigQuery and Redshift
|Cloud platform support||Google Cloud only||AWS only|
|Scalability||Virtually infinite and automatic||Traditional nodes limited in scale. New RA3 nodes match BigQuery’s scalability.|
|Performance||Competitive, but with more automated optimization.||Standard bearer with improving auto-optimization|
|Security and Encryption||A la carte security and encryption||A la carte security and encryption|
|Ecosystem and Third-Party Integrations||Fast-growing ecosystem with major third-party integrations||Largest ecosystem and most third-party integration|
|Maintenance||Limited maintenance required||Manual maintenance required with improving auto-maintenance features|
|Pricing||Pay per query; Discounts and pre-paid packages available.||Per per hour; Discounts and pre-paid packages available.|
|BI vs. Machine Learning||Expensive for BI; strong machine learning toolset||Cost effective for BI; strong machine learning toolset|
|End User Experience||Easier to manage for small developer teams||More customization and flexibility for larger teams|
|Resources Available||Better for limited technical resources, but can get expensive quickly||Cost-savings for companies with technical resources|
# Cloud Platform Support
Unlike our previous comparison of Snowflake and Redshift, cloud platform options are very straightforward here. BigQuery only works with Google Cloud, while Redshift works on AWS.
Snowflake is the only major cloud warehouse that is cloud agnostic.
Traditionally, Amazon Redshift scaled manually through coupled storage/compute server clusters. Then Redshift introduced a new, decoupled node in 2019 — the RA3.
However, the majority of Redshift users still use the traditional node clusters. Provisioning new nodes can take anywhere from a few hours to a few days.
Google BigQuery is built on a different style of architecture. It’s more similar to Snowflake — compute and storage are decoupled and scaling is automatic. BigQuery users never have to worry about provisioning nodes or adjusting the size of their cluster.
BigQuery users never have to worry about provisioning nodes or adjusting the size of their cluster.
But Redshift has other scalability options. To handle peaks in data volume or query processing, you can use Redshift Spectrum.
Spectrum offers short-term, instantaneous scalability by running queries on data stored in low-cost AWS S3 storage. This enables short bursts of analytical processing without adding nodes to your cluster.
At the end of the day, analytical performance is your top priority. As Massively Parallel Processing (MPP) data warehouses, both Redshift and BigQuery are orders of magnitude faster than traditional row-based databases.
But between the two, which has more horsepower?
There are very few objective, apples-to-apples benchmarks comparing Redshift and BigQuery. The most reliable is this one, conducted in 2018.
With the right optimizations, Redshift and BigQuery are virtually identical in performance.
According to this benchmark test, Redshift and BigQuery are virtually identical in performance. That shouldn’t be a surprise. These wouldn’t be the two most popular data warehouses if they couldn’t compete on performance.
The real advantage in performance comes down to optimization. This is where BigQuery and Redshift take two different approaches.
BigQuery automates most performance optimization for you. The only thing to pay attention to is table design. Look at partition keys, clustering, and range partitioning to optimize your query scans and avoid excessive scans.
Redshift requires more manual optimization. This is both a positive and negative.
On the positive side, a data warehouse architect can optimize Redshift to perfectly suit their needs. The negative is that you need developer resources to do this.
Data warehouse security is a broad topic, so we don’t get into all of it here. The key thing to understand is that security is generally determined by the cloud platform — in this case, Google Cloud and AWS. Both offer best-in-class security features for users of all sizes.
For data encryption, both BigQuery and Redshift rely on proprietary key management services. This is different from Snowflake, where encryption is applied to all data by default, with varying levels of encryption based on your subscription.
# Ecosystem/3rd-party integrations
AWS still reigns supreme as the most comprehensive cloud ecosystem on the planet. The AWS Marketplace has almost 10,000 integrations from over 500 partner vendors. Google Cloud Marketplace has about 4000 integrations.
However, virtually every major big data product integrates with both, from Kubernetes to MongoDB to Salesforce.
The most common complaint about Amazon Redshift is the need for manual maintenance.
As you import data and run queries, Redshift gets bogged down with unused tables and rows. This buildup can eventually lead to performance issues.
To maintain the health of your Redshift cluster, you need to schedule VACUUM SORT, VACUUM DELETE, and ANALYZE commands to run occasionally. As of 2019, these maintenance commands run automatically in most cases.
However, if your dataset is too large for the VACUUM SORT function, you will need to either resize your cluster or live with unsorted data — neither ideal scenarios.
BigQuery automates the vast majority of warehouse maintenance.
BigQuery, on the other hand, automates the vast majority of warehouse maintenance. As you’ll see below, the most important thing you’ll need to maintain are your computing costs.
# Business Considerations for Redshift and BigQuery
From a technical perspective, Redshift and BigQuery are very similar. The biggest differences lie in maintenance, customization, and third-party integrations.
But technical details aren’t the only factors that go into choosing a data warehouse. Below, we’ll look at several business considerations for each tool:
Analytics vs. Machine learning
End user experience
Required developer resources
Redshift’s pricing model is straightforward and extremely customizable.
Redshift’s pricing model is straightforward and extremely customizable. Pricing is on-demand and based on usage per hour. Add-on services, such as Redshift Spectrum, are priced separately.
Redshift pricing is largely determined by the types of nodes you choose. Dense Compute DC2 nodes are the least expensive, while Redshift’s newest nodes — the decoupled RA3 — are the priciest.
With Redshift, you can mix-and-match nodes to find that perfect balance of performance and price.
BigQuery’s pricing model is completely different — and slightly controversial.
“With BigQuery, you pay by the query, not per hour.”
Instead of paying for usage by the hour, you pay for each query.
Depending on your business, the BigQuery pricing model is either a huge advantage or potential disaster.
If you use your data warehouse for business analytics — running dozens, or even hundreds of queries per day — BigQuery costs can quickly get out of hand.
However, if you only run the occasional query — like during machine learning research — BigQuery’s pricing model would suit you well.
You can also pay upfront for BigQuery processing power via slots. In this pricing model, you pay for a fixed amount of processing power each month.
Pricing for Redshift and BigQuery also vary by region.
Business Intelligence vs. Machine learning
Business Intelligence and Machine learning are two of the most common use cases for data warehouses. From a pricing perspective, Redshift has a clear upperhand in the BI department. It’s usage-per-hour pricing model caters to query-heavy users.
From a pricing perspective, Redshift has a clear upperhand in the BI department. It’s usage-per-hour pricing model caters to query-heavy users.
For example, if you’re like 99Designs and need a real-time business analytics dashboard, then Redshift is the warehouse for you.
For the machine learning use case, BigQuery makes a much stronger case. BigQuery’s ML Engine and Redshift Sagemaker go head-to-head in machine learning and data science applications.
Here’s a great comparison of ML Engine and Sagemaker from Ashish Awasthi on Quora:
“Both, AWS and Google-cloud, provide following machine learning services, for the use-case ‘training custom models with your own data’:
Jupyter notebook, with backend running on a cloud VM, that has pre-installed machine learning frameworks and cloud services clients.
A hosted model training and hyper-parameter optimization service, to which training jobs can be submitted remotely. The jobs can be monitored on a cloud provided UI.
A model repository and scalable model hosting service for inference.
Built-in algorithms (model-architectures, parameter-tuning and data-augmentation) to train your models, with just data.
There are some differences though.
For hosted training (use-case 2) in the list above, ML engine supports only Tensorflow, scikit-learn, PyTorch and XGBoost frameworks. On SageMaker, you can use MXNet, Chainer and SparkML too.
In ML engine, you can do data aggregations as part of data pre-processing within the model execution code, with ‘tf.data’ (which internally uses server-less Dataflow framework). In SageMaker, an external Spark job will be required to do that.
You get new versions of Tensorflow on ML engine weeks before you get them on SageMaker.
In case you plan to use automatic hyper parameter optimization, it works better on ML engine, in terms of results produced and time taken.
If you train your model using built-in algos of SageMaker, you cannot deploy it outside SageMaker. Which is also the case for Google’s AutoML, though all models trained on ML-engine (including those using Google’s tensorflow-hub modules) can be deployed anywhere.
If you want to know what are the corresponding cloud services in AWS, Google Cloud and Azure, check this list by use-cases.”
It’s important to note that ML Engine and Sagemaker are priced separately from BigQuery and Redshift.
End User Experience
Depending on the size and structure of your organization, you may have different end users who have different needs.
For example, smaller organizations may have one full-stack developer responsible for all data architecture and business analytics. In cases like this, the end user may prefer the streamlined experience of BigQuery.
However, Redshift provides flexibility and customization for organizations with dedicated DevOps teams. You can easily integrate Redshift with your favorite BI tool to make analytics simple for business users, too.
Redshift provides flexibility and customization for organizations with dedicated DevOps teams.
Available Developer Resources
Similarly, organizations with scarce developer resources may prefer BigQuery due to its automated maintenance and scalability. But don’t forget to balance usability with pricing, which can quickly get out of hand with BigQuery if you run a lot of queries.
While Redshift requires skilled engineering talent to build and maintain, much of this work can be automated by 3rd-party tools like FlyData for ETL. FlyData is the easiest way to replicate data to Redshift, so you can spend more time building your business.
# The Final Verdict: Redshift vs. BigQuery
Redshift and BigQuery are very different data warehouses, each with pros and cons.
Where Redshift offers customization and flexibility, BigQuery gives you automation and effortless scalability.
While you pay per hour on Redshift, you pay per query on BigQuery. One is not better than the other — it just depends on your use cases.
The bottom line is that both Redshift and BigQuery are world-class data warehouses. Redshift remains the standard-bearer in the space, but BigQuery brings unique advantages to its users.
No matter what data warehouse you choose, our hope is that this guide helped illuminate your way.
For more research and resources on cloud computing and data analytics, follow FlyData on Twitter.