What is a Redshift cluster? How can I create one? The core component of AWS's cloud data warehouse Redshift is the Redshift cluster. If you're not familiar with MPP Databases or how Redshift compares to traditional data warehouses, we recommend you read our guides on those before continuing. In this post, we’ll show you to create your own Redshift database in the AWS console, and when you should think about resizing the number of nodes in your cluster.
# How to Create a Redshift Cluster
Step 1 - Intro to the AWS Redshift Console
Create an AWS account or sign in to your Amazon console. In the upper right-hand corner, select the region you want to create the cluster in. We have a whole guide on how Amazon’s regions affect Redshift pricing and how you can select the region that is best for you here.
In the “AWS Services” box, type “Redshift”, and click on it when it comes up.
This page will be your home base for managing your Redshift instances, so let's examine it for a minute:
- Left Sidebar
- Clusters - Existing clusters that you’ve already created
- Snapshots - backups of your Redshift cluster at a specific time
- Security - manage access to your Redshift cluster
- Parameter Groups - configure settings such as query timeout
- Workload Management - Define queues for your queries so your most important queries are prioritized
- Reserved Nodes - Save $$ by pre-purchasing a reserved node on a 1-year to a 3-year contract basis.
- Events - A log of important events related to your Redshift instances that AWS keeps track of, such as creating, updating, and deleting clusters.
- Connect Client - Use this to connect to your Redshift cluster using third-party SQL client tools. You’ll need this if you want to use a BI tool like Looker or Tableau on top of your Redshift instance.
- Center section
- Launch Cluster - This is where you can launch new clusters. Yay! We’ll be doing that in a minute.
- Resources - All the same stuff from the sidebar on the left but with some helpful (?) numbers in parentheses.
- Service Health - Is everything in AWS-land looking good?
- Right sidebar (mostly ignore)
- Recommend resources such as AWS’s Getting Started Guide
- Ancillary services to Redshift that AWS recommends via their marketplace
Step 2 - Create a Security Group
Before we actually create our cluster, we’ll set up our Security Group so that we can control access to our Redshift cluster. Click on the link labeled “Security” in the sidebar on the left. Then click the big blue button that says “Create Cluster Security Group”.
For demo purposes, we will be naming the group “demo-redshift” and the description will simply be “demo”.
On the “Status” column, you will notice that your newly created security group is not yet authorized. This means that the security group doesn’t contain any information about who can or cannot access it. To solve this, click on the name of the Security Group, then click on “Add Connection Type”.
In the modal, select “CIDR/IP” and enter the IP addresses to authorize. In this demo, we simply want to give our own computer access to this cluster, so we will be just be pasting the IP address of our own computer.
Now that we have authorized our security group, we can now proceed to create a cluster.
Step 3 - Create a Redshift Cluster
Go back to the “Clusters” section on the left side of your page and hit the button titled “Launch Cluster”. Fill out the form screenshotted below. There’s a nice overview of each input on the right-hand side:
After filling out the form, click “Continue”.
In the “Node Configuration” section, you will see a drop-down menu titled “Node Type”. All of these nodes vary according to storage type, memory, and CPU. For a brief overview of node types and their associated costs, click here. For our tutorial, we’ll be creating a single node cluster, with Redshift’s dc2.large node, which is eligible for Redshift’s Free Trial. Fill out the form like so:
The next page is where we can specify more advanced options. Let’s take it step by step.
- Cluster Parameter Group - We will be using “default.redshift-1.0” as our cluster parameter group. Cluster parameter groups set configurations such as query timeouts. For more information on managing parameter groups, check out this AWS guide.
- Encrypt Database - Unless you know what you’re doing, leave it set to “none” for now. You can read more about encrypting your Redshift database here.
- Choose a VPC - VPC stands for Virtual Private Cloud and works as your own virtual network within AWS. A Redshift database must be created within a VPC. Select your default VPC from the dropdown.
- Publically Accessible - Yes. You want to be able to access your Redshift instance.
- Choose a Public IP Address - No. Not needed for most cases, and a bit of a hassle to set up.
- Choose a VPC Security Group - Remember the security group we made way back in Step #1 so that we could access Redshift from our local computer? Select that security group (we named it redshift-demo).
- Create CloudWatch Alarm - Amazon CloudWatch is AWS’s monitoring service that you can use to trigger alerts based on your Redshift instance’s performance metrics. For now, we’ll leave it as is. You can always modify this later.
- IAM roles - You can give your Redshift instance access to other AWS services though IAM roles. For now, we’ll leave this blank.
Click “Continue”, and then click “Launch cluster”.
Congratulations! Your Redshift cluster is now being created. If you navigate back to your list of clusters by clicking “Clusters” in the left sidebar, you’ll see the column for “Cluster Status”. When this column is filled with a green “Available”, you’ll be ready to go.
Resizing Your Redshift Cluster
At various times, it may be beneficial to resize your Redshift cluster to either save money by scaling down or improve performance by scaling up. In this next section, we’ll show you why resizing your cluster to fit your needs and dataset is important, what you should look for so you know when you need to do so, how to properly resize your cluster within the AWS Console, and what is happening “under the hood” to your Redshift database.
Because Redshift is an MPP database, performance rapidly improves as we scale up the number of nodes in our cluster. Moving from 1 node to 2 nodes for a 1.2TB dataset had the following impact for load and query time:
As we can see from the test, adding a second node to your cluster can cut load time and query time by roughly 50%.
According to our friends over at TechTarget, CPU utilization is the most important performance metric to monitor for determining if you need to resize your cluster. If CPU utilization is consistently about 80%, consider adding some nodes to bring it down. Likewise, if your CPU utilization is consistently below 40%, you’ll probably be able to save some money without sacrificing performance by scaling down your nodes. To see a full breakdown of how node types & node quantity affect the cost of your Redshift instance, see our guide on Redshift’s pricing structure.
Resizing your Redshift cluster is pretty simple and AWS has a great guide on it, so we won’t spend too long on it. From your list of clusters, click on the name of the cluster you want to edit. On the next page, underneath the name of your cluster in big letters, there will be a grey button that says “Cluster”. Click that button, and then click “Resize Cluster”. A modal window will pop open that will allow you to edit your node types and the number of nodes in your cluster. Click “Resize” and you’re on your way.
As soon as you click “Resize”, your cluster will switch to read-only mode until the resize is complete. This means that any and all queries and database connections will be terminated both when the resize operation begins and when ends. It will also be impossible to write data to your Redshift cluster - an important thing to know if you’re using a real-time sync service like Flydata.
Once the resize operation is complete, to ensure your new cluster is performing optimally, you’ll want to follow our short guide on optimizing your Redshift instance.
# What is FlyData?
FlyData provides continuous, near real-time replication into Redshift from your transactional databases, such as MySQL, PostgreSQL, Amazon Aurora, and more. With an easy, one-time setup, our robust system ensures 100% accuracy with each load. Your data is always up to date. For questions about FlyData and a no-pressure, risk-free assessment on how we can possibly help make your journey on to Amazon Redshift smoother and easier, connect with us at firstname.lastname@example.org.