After using Integrate.io to load data into Amazon Redshift, you may want to extract data from your Redshift tables to Amazon S3. There are various reasons why you would want to do this, for example:

  • You want to load the data in your Redshift tables to some other data source (e.g. MySQL)
  • To better manage space in your Redshift cluster, you want to unload some unused data to S3 (so that it can be later loaded into the cluster if required)
  • There are complex transformations to perform on your data and you would like to extract the data to S3
  • You want to copy data from one Redshift cluster to another

Whatever the reason, Redshift provides you with the UNLOAD SQL command to accomplish this.

How to use UNLOAD

Let's look at how to use UNLOAD command with some examples. Before we start, let's create a sample table customer as defined below and insert a few records into it:

CREATE TABLE "customers" (
    "id" integer NOT NULL,
    "name" varchar(100),
    "email" varchar(100),
    "city" varchar(50) DEFAULT NULL,
    "state" char(2) DEFAULT NULL,
    "created_at" date DEFAULT NULL,
    PRIMARY KEY (id)
);

To unload all the records in this table to S3 at location 's3://flydata-test-unload/unload-folder', we can run the following command

UNLOAD ('select * from customers')
TO 's3://flydata-test-unload/unload-folder/customer_' credentials
'aws_access_key_id=your_access_key;
aws_secret_access_key=your_secret_key';

This generates the following files

s3://flydata-test-unload/unload-folder/customer_0000_part_00
s3://flydata-test-unload/unload-folder/customer_0001_part_00

Why did the command generate multiple files even though there were very few records in the table? The reason is, I was running UNLOAD on a single node (dw1.xlarge) cluster that has two slices. By default, UNLOAD writes one of more files per slice. Also, as you can see, each file takes the prefix that was specified in the command (customer_) Let's say, you want to copy data in these files back to the cluster (or a different cluster). You can use the below COPY command to do that

COPY customers
FROM 's3://flydata-test-unload/unload-folder/customer_' credentials
'aws_access_key_id=your_access_key;aws_secret_access_key=your_secret_key';

Things to be aware of when using UNLOAD

Below is a list of things you need to be aware of when using UNLOAD

  • Pipe (|) is the default delimiter. You can override this using DELIMITER AS 'delimiter_char' option
  • The S3 bucket specified in the command should be in the same region as your cluster. If they are in different regions, you will most likely see an error like below when trying to UNLOAD
ERROR: S3ServiceException:The bucket you are attempting to access must be addressed
using the specified endpoint. Please send all future requests to this
  • If there are quotes in the query, they need to be escaped. As an example, let's say we want to extract all the customers who reside in the state SC, the command should NOT be UNLOAD ('select * from customers where state = 'SC'')... It should be UNLOAD ('select * from customers where state = \'SC\'')...
  • You cannot use LIMIT clause in the outer SELECT statement of your query. So, if you try to execute an UNLOAD command that has select query like this: select * from customers limit 10, you will see an error like below:
Limit clause is not supported: select * from customers limit 10
  • To workaround this limitation, you can use a nested LIMIT clause. The above query can be re-written as:
select * from customers where id in (select id from customers limit 10)
  • If the destination folder already contains un-loaded data, UNLOAD will fail with the below error: ERROR: Specified unload destination on S3 is not empty. You should either manually delete the files before unload or specify ALLOWOVERWRITE option
  • If the data you are unloading includes the delimiter you are using, then you should use ESCAPE option with the UNLOAD command. If you don't, subsequent copy operations may fail or they will give undesirable results
  • To compress the unloaded data, you can use GZIP option with the command

Some More Tips

Below are a few common problems when using UNLOAD and how to deal with it Problem: You have limited data in your table and would like to unload the data into a single S3 file and not one file per slice. Solution: UNLOAD provides the PARALLEL option for this. By default, PARALLEL is ON, you can set it to OFF using PARALLEL OFF option. When you turn it off, Redshift writes to S3 serially and will not create multiple files according to the number of slices in the cluster. But, please be aware that the maximum size of a data file is 6.2 GB. So, if you try to UNLOAD data exceeding that limit, you will still see multiple files on S3. Problem: You would like to unload data with column headers Solution: There is no direct option for this. But, you can tweak the query to output column names along with the data. For the customers table, below will be the query:

UNLOAD ('
    SELECT id, name, email, city, state, created_at FROM (
        SELECT 1 as ordinal, \'id\' as id, \'name\' as name, \'email\' as email, \'city\' as city, \'state\' as state, \'created_at\' as created_at
        UNION all
        ( SELECT 2 as ordinal, CAST(id as varchar(255)) as id, name, email, city, state, CAST(created_at as varchar(255)) as created_at
        FROM customers )
    ) t ORDER BY ordinal
') TO 's3://flydata-test-unload/unload-folder/customer_' credentials
'aws_access_key_id=your_access_key;aws_secret_access_key=your_secret_key'
ESCAPE;

Please note that for any non-character column type you need to explicitly cast the column as char or varchar.

Conclusion

Though it is easy to use the UNLOAD command, there are a few things you need to take care of, as explained in this article. We suggest that you first try the command on sample data and make sure that all your options are set right before trying to extract a large amount of data from Redshift. There are also a few other options that UNLOAD supports which we have not covered here. You can find them in AWS documentation.

How Integrate.io Can Help

Integrate.io provides continuous, near real-time replication between RDS, MySQL and PostgreSQL databases to Amazon Redshift. The Integrate.io Sync tool is an intuitive, powerful, cost-effective way to integrate your transactional databases and analytical data stores in a single interface with no manual scripting.

You can start a 14-day Free Trial and begin syncing your data within minutes. For questions about Integrate.io and how we can help accelerate your use-case and journey on Amazon Redshift.