Quick Review of Amazon Machine Learning Using Amazon Redshift as a Data Source

in Data Engineering, Redshift
 • Updated on 
Quick-Review-of-Amazon-Machine-Learning-Using-Amazon-Redshift-as-a-Data-Source

# Amazon Machine Learning Announced

Yesterday, AWS announced Amazon Machine Learning, which is set to vastly expand the number of companies that will perform machine learning on their data. The benefits of using Amazon Machine Learning is that we can use AWS’s internal algorithms at low cost. With this service, data scientists should be able to save time by not writing complex algorithms on their own. The service detects useful prediction models based on your data, so as long as you have basic skills in statistical methods and to read charts, you can benefit from these algorithms. AWS mentions that they have based Amazon Machine Learning on ML technology they have been using internally, so we can imagine that it will scale quite well depending on the amount of data involved. Built on this scalable platform, it looked like a very attractive service offering, so we decided to test it out with our sample data stored on Amazon Redshift.

# Using Amazon Machine Learning

Amazon Machine Learning currently supports S3 or Amazon Redshift as the data source (they mention MySQL on RDS on their website, but it appears they do not support this yet). To try it out, we took some aggregated usage data from our FlyData Sync service. We dropped in the steps we took so you can follow along (using your data) if you’d like. Let’s see what we can get back from Amazon Machine Learning.

# Getting started

Here is the main page of Amazon Machine Learning (Fig. 1). After clicking “Get started”, the next screen showed two setup options, so we selected the “Standard setup” and clicked the “Launch” button. Then, through the setup wizard (Fig. 2) we selected Redshift as the data source.

ML1

(Fig. 1)

ML2

(Fig. 2)

As you see, there are several details to fill in. Setting an IAM role was the first thing required for Amazon Machine Learning to connect to Redshift and to a staging S3 bucket, so we started there.

# Creating an IAM Role

Creating IAM roles could feel tricky at first, but fortunately AWS made this quite simple for Amazon Machine Learning. The simplest way is to use the IAM role template Amazon prepared, which you can get to with several clicks. Here are the steps. First go to Amazon IAM and select “Role” in the right menu, then click “Create New Role”. After inputting a new role name, you can find the preset role “Amazon Machine Learning Role for Redshift Data Source”. Click “Select” and go through the rest of the wizard to create the role.

ML3

# Creating an S3 Bucket

After creating the IAM role, we had to create an S3 bucket. Amazon Machine Learning will stage data from Redshift into this bucket, before creating its datasource object. Simply create an S3 bucket, and copy the URL for that bucket. You can then fill this S3 bucket URL into the Amazon Machine Learning form. You also need to prepare a SQL query, which will be used to extract your Redshift data. Amazon Machine Learning only reads from a flat file stored in S3, so if you are analyzing data across multiple tables, you will need to create a SQL query that properly joins your tables. Once the above preparation is done, we can click on “Verify”, which starts checking the data from Redshift. The next screen will show you the schema of the data set the service automatically detected based on your SQL and Redshift data. You can modify the data type as necessary. Next, you will select the target of the analysis. For example, if you choose a numeric target, the analysis will be numerical regression.

ML4

On this target page, you are setting up a target value (Y value) for your data. Once you select the target, Amazon Machine Learning will start creating a datasource object, as well as present basic and some advanced statistics on the data. This is one of the results we got.

ML5

# Creating the Model

The next step is to create the model. To do so, you can select “Create (train) ML model” on the data source details. If you use the default setting for your prediction model, it will use 70% of the data for training and the remaining 30% to evaluate the model.

console-aws-amazon-machinelearning-home

After creating the prediction model, you can check the residual of your model.

# Performing Predictions

The last step is to make predictions using the prediction model that was created. Go to data source details and click on the “Use the datasource to” dropdown and then select “Generate batch prediction”. In the next screen, you can select the prediction model to use, the data to analyze, and the S3 bucket for storing your result. Once you click on the “Finish” button on the "Step 4. Review" tab, Amazon Machine Learning will run the algorithm on your data and will save the results to the S3 bucket you specified.

ML7

The result is only available on S3, so you will need to download the data from there.

ML8

# Pricing Breakdown

The pricing was not immediately apparent for us (we couldn’t guess immediately how much it will cost, based on our data size), so we hope the following breakdown of charges we incurred will help you get an idea of the costs. For our test data we had about 300MB of data, which cost us a total of $0.42 for a few minutes on model creation and $0.10 for prediction fees.

# Cost structure

The cost structure for Amazon Machine Learning can be broken down into two aspects.

  • Data Analysis and Model Building Fees
  • Prediction Fees

Your total cost will be the sum of the two costs.

# Data Analysis and Model Building Fees

These fees are for creating your model. The factors that affect the Data Analysis and Model Building Fees are:

  • The size of the input data;
  • The number of attributes within it; and,
  • The number and types of transformations applied.

Based on these factors, the amount of computing time you need will change. This computing time is billed at $0.42/hour.

# Prediction Fees

These fees are for applying your model to each of your data. You have the choice of Batch or Real-time, depending on your situation. For this article, we tried the Batch option. The costs are as below.

  • Batch

    • $0.10 per 1,000 predictions, rounded up to the next 1,000
  • Real time

    • $0.0001 per prediction, rounded up to the nearest penny

# Our Actual Costs

For our test data we had about 300MB of data, which cost us a total of $2.25 for 10 to 15 minutes or so on the model creation and $0.10 for prediction fees. A thing to note is that the charge is on “Instance-hours”, which could be much more than the actual waiting time you experience. Our 10 to 15 minute waiting time came out to more than 5 hours of “Instance-hours”. In our case, the actual charge was not much, but it is something to keep in mind.

ML10

Here’s AWS’s official pricing page.

# Other Tips

Currently Amazon Machine Learning is available only in the US Standard (us-east-1) region. If your Redshift cluster is in another region, you need to transfer your snapshot and launch another cluster in the US-EAST region for now.

ML9

Another miscellaneous tip: their pricing $0.42/hour to create models could possibly have come from here.

# Summary

Overall the experience of trying out Amazon Machine Learning was a bit more complicated than we were hoping, but we did see how this could open up machine learning to a much wider audience. For example, setting up the IAM role, creating an S3 bucket, thinking about the SQL join query, etc, were steps that were a bit cumbersome. But the ease at which we could get the prediction model (without thinking too much about the algorithm) was quite impressive. Another huge benefit we see is that this was built with the assumption of really large data sets. The fact that AWS based this on their internal ML algorithms, and that they already have Amazon Redshift as a data source, suggests that Amazon Machine Learning anticipates data sets of a considerable size. This scalability, at this price, is quite noteworthy. Having said that, we only performed a simple test run, so we’re sure there are many features that we missed on our first run. The real-time predictions feature, in particular, looks very interesting, and is something that we’d love to try out. We’ll update our blog as we learn more from using Amazon Machine Learning.

References

Amazon Machine Learning: http://aws.amazon.com/machine-learning/ Amazon Machine Learning Pricing: http://aws.amazon.com/machine-learning/pricing/ Basic intro on ML: http://docs.aws.amazon.com/machine-learning/latest/mlconcepts/ Docs on Amazon Machine Learning: http://docs.aws.amazon.com/machine-learning/latest/dg/

Worry-free replication from source to Redshift & Snowflake
Unlimited sync during trial
No credit-card required
World class support
Try FlyData for free
Amazon Partner Logo Certified AWS
Redshift partner
Get started. Try FlyData.
Quick setup. No credit card required. Unlimited sync during trial.
Fast and secure Enterprise-grade security and near real-time sync.
World-class support Proactive monitoring from technical experts, 24/7.
Straightforward pricing Pay for the rows you use, and nothing you don’t.