Batch Transform Using R with Amazon SageMaker

This notebook’s CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

Read before running this notebook:

This sample notebook has been updated for SageMaker SDK v2.0.
If you are using SageMaker Notebook instances, select R kernel for the notebook. If you are using SageMaker Studio notebooks, you will need to create a custom R kernel for your studio domain. Follow the instructions in this blog post to create and attach a custom R kernel.
- Bringing your own R environment to Amazon SageMaker Studio

Summary:

This sample Notebook describes how to do batch transform to make predictions for an abalone’s age, which is measured by the number of rings in the shell. The notebook will use the public abalone dataset originally from UCI Machine Learning Repository.

You can find more details about SageMaker’s Batch Transform here: - Batch Transform using a Transformer

We will use reticulate library to interact with SageMaker: - `Reticulate library <https://rstudio.github.io/reticulate/>`__: provides an R interface to use the Amazon SageMaker Python SDK to make API calls to Amazon SageMaker. The reticulate package translates between R and Python objects, and Amazon SageMaker provides a serverless data science environment to train and deploy ML models at scale.

Table of Contents: - Reticulating the Amazon SageMaker Python SDK - Creating and Accessing the Data Storage - Downloading and Processing the Dataset - Preparing the Dataset for Model Training - Creating a SageMaker Estimator - Batch Transform using SageMaker Transformer - Download the Batch Transform Output

Note: The first portion of this notebook focused on data ingestion and preparing the data for model training is inspired by the data preparation section outlined in the “Using R with Amazon SageMaker” notebook on AWS SageMaker Examples Github repository with some modifications.

Reticulating the Amazon SageMaker Python SDK

First, load the reticulate library and import the sagemaker Python module. Once the module is loaded, use the $ notation in R instead of the . notation in Python to use available classes.

[ ]:

# Turn warnings off globally
options(warn=-1)

[ ]:

# Install reticulate library and import sagemaker
library(reticulate)
sagemaker <- import('sagemaker')

Creating and Accessing the Data Storage

The Session class provides operations for working with the following boto3 resources with Amazon SageMaker:

S3
SageMaker

Let’s create an Amazon Simple Storage Service bucket for your data.

[ ]:

session <- sagemaker$Session()
bucket <- session$default_bucket()
prefix <- 'r-batch-transform'

Note - The default_bucket function creates a unique Amazon S3 bucket with the following name:

sagemaker-<aws-region-name>-<aws account number>

Specify the IAM role’s ARN to allow Amazon SageMaker to access the Amazon S3 bucket. You can use the same IAM role used to create this Notebook:

[ ]:

role_arn <- sagemaker$get_execution_role()

Downloading and Processing the Dataset

The model uses the abalone dataset originally from the UCI Machine Learning Repository. First, download the data and start the exploratory data analysis. Use tidyverse packages to read, plot, and transform the data into ML format for Amazon SageMaker:

[ ]:

library(readr)
data_file <- 's3://sagemaker-sample-files/datasets/tabular/uci_abalone/abalone.csv'
abalone <- read_csv(file = sagemaker$s3$S3Downloader$read_file(data_file, sagemaker_session=session),
                    col_names = FALSE)
names(abalone) <- c('sex', 'length', 'diameter', 'height', 'whole_weight', 'shucked_weight', 'viscera_weight', 'shell_weight', 'rings')
head(abalone)

The output above shows that sex is a factor data type but is currently a character data type (F is Female, M is male, and I is infant). Change sex to a factor and view the statistical summary of the dataset:

[ ]:

abalone$sex <- as.factor(abalone$sex)
summary(abalone)

The summary above shows that the minimum value for height is 0.

Visually explore which abalones have height equal to 0 by plotting the relationship between rings and height for each value of sex:

[ ]:

library(ggplot2)
options(repr.plot.width = 5, repr.plot.height = 4)
ggplot(abalone, aes(x = height, y = rings, color = sex)) + geom_point() + geom_jitter()

The plot shows multiple outliers: two infant abalones with a height of 0 and a few female and male abalones with greater heights than the rest. Let’s filter out the two infant abalones with a height of 0.

[ ]:

library(dplyr)
abalone <- abalone %>%
  filter(height != 0)

Preparing the Dataset for Model Training

The model needs three datasets: one for training, testing, and validation. First, convert sex into a dummy variable and move the target, rings, to the first column. Amazon SageMaker algorithm require the target to be in the first column of the dataset.

[ ]:

abalone <- abalone %>%
  mutate(female = as.integer(ifelse(sex == 'F', 1, 0)),
         male = as.integer(ifelse(sex == 'M', 1, 0)),
         infant = as.integer(ifelse(sex == 'I', 1, 0))) %>%
  select(-sex)
abalone <- abalone %>%
  select(rings:infant, length:shell_weight)
head(abalone)

Next, sample 70% of the data for training the ML algorithm. Split the remaining 30% into two halves, one for testing and one for validation:

[ ]:

abalone_train <- abalone %>%
  sample_frac(size = 0.7)
abalone <- anti_join(abalone, abalone_train)
abalone_test <- abalone %>%
  sample_frac(size = 0.5)
abalone_valid <- anti_join(abalone, abalone_test)

Upload the training and validation data to Amazon S3 so that you can train the model. First, write the training and validation datasets to the local filesystem in .csv format. Then, upload the two datasets to the Amazon S3 bucket into the data key:

[ ]:

write_csv(abalone_train, 'abalone_train.csv', col_names = FALSE)
write_csv(abalone_valid, 'abalone_valid.csv', col_names = FALSE)

# Remove target from test
write_csv(abalone_test[-1], 'abalone_test.csv', col_names = FALSE)

[ ]:

s3_train <- session$upload_data(path = 'abalone_train.csv',
                                bucket = bucket,
                                key_prefix = paste(prefix,'data', sep = '/'))
s3_valid <- session$upload_data(path = 'abalone_valid.csv',
                                bucket = bucket,
                                key_prefix = paste(prefix,'data', sep = '/'))

s3_test <- session$upload_data(path = 'abalone_test.csv',
                                bucket = bucket,
                                key_prefix = paste(prefix,'data', sep = '/'))

Finally, define the Amazon S3 input types for the Amazon SageMaker algorithm:

[ ]:

s3_train_input <- sagemaker$inputs$TrainingInput(s3_data = s3_train,
                                     content_type = 'csv')
s3_valid_input <- sagemaker$inputs$TrainingInput(s3_data = s3_valid,
                                     content_type = 'csv')

Hyperparameter Tuning for the XGBoost Model

Amazon SageMaker algorithms are available via a Docker container. To train an XGBoost model, specify the training containers in Amazon Elastic Container Registry (Amazon ECR) for the AWS Region. We will use the latest version of the algorithm.

[ ]:

container <- sagemaker$image_uris$retrieve(framework='xgboost', region= session$boto_region_name, version='latest')
cat('XGBoost Container Image URL: ', container)

Define an Amazon SageMaker Estimator, which can train any supplied algorithm that has been containerized with Docker. When creating the Estimator, use the following arguments: * image_uri - The container image to use for training * role - The Amazon SageMaker service role * train_instance_count - The number of Amazon EC2 instances to use for training * train_instance_type - The type of Amazon EC2 instance to use for training * train_volume_size - The size in GB of the Amazon Elastic Block Store (Amazon EBS) volume to use for storing input data during training * train_max_run - The timeout in seconds for training * input_mode - The input mode that the algorithm supports * output_path - The Amazon S3 location for saving the training results (model artifacts and output files) * output_kms_key - The AWS Key Management Service (AWS KMS) key for encrypting the training output * base_job_name - The prefix for the name of the training job * sagemaker_session - The Session object that manages interactions with Amazon SageMaker API

[ ]:

# Model artifacts and batch output
s3_output <- paste('s3:/', bucket, prefix,'output', sep = '/')

[ ]:

# Estimator
estimator <- sagemaker$estimator$Estimator(image_uri = container,
                                           role = role_arn,
                                           train_instance_count = 1L,
                                           train_instance_type = 'ml.m5.4xlarge',
                                           train_volume_size = 30L,
                                           train_max_run = 3600L,
                                           input_mode = 'File',
                                           output_path = s3_output,
                                           output_kms_key = NULL,
                                           base_job_name = NULL,
                                           sagemaker_session = NULL)

Note - The equivalent to None in Python is NULL in R.

Next, we Specify the XGBoost hyperparameters for the estimator.

Once the Estimator and its hyperparamters are specified, you can train (or fit) the estimator.

[ ]:

# Set Hyperparameters
estimator$set_hyperparameters(eval_metric='rmse',
                              objective='reg:linear',
                              num_round=100L,
                              rate_drop=0.3,
                              tweedie_variance_power=1.4)

[ ]:

# Create a training job name
job_name <- paste('sagemaker-r-xgboost', format(Sys.time(), '%Y%m%d-%H-%M-%S'), sep = '-')

# Define the data channels for train and validation datasets
input_data <- list('train' = s3_train_input,
                   'validation' = s3_valid_input)

# train the estimator
estimator$fit(inputs = input_data, job_name = job_name)

Batch Transform using SageMaker Transformer

For more details on SageMaker Batch Transform, you can visit this example notebook on Amazon SageMaker Batch Transform.

In many situations, using a deployed model for making inference is not the best option, especially when the goal is not to make online real-time inference but to generate predictions from a trained model on a large dataset. In these situations, using Batch Transform may be more efficient and appropriate.

This section of the notebook explains how to set up the Batch Transform Job and generate predictions.

To do this, we need to identify the batch input data path in S3 and specify where generated predictions will be stored in S3.

[ ]:

# Define S3 path for Test data
s3_test_url <- paste('s3:/', bucket, prefix, 'data','abalone_test.csv', sep = '/')

Then we create a Transformer. Transformers take multiple paramters, including the following. For more details and the complete list visit the documentation page.

model_name (str) – Name of the SageMaker model being used for the transform job.
instance_count (int) – Number of EC2 instances to use.
instance_type (str) – Type of EC2 instance to use, for example, ‘ml.c4.xlarge’.
output_path (str) – S3 location for saving the transform result. If not specified, results are stored to a default bucket.
base_transform_job_name (str) – Prefix for the transform job when the transform() method launches. If not specified, a default prefix will be generated based on the training image name that was used to train the model associated with the transform job.
sagemaker_session (sagemaker.session.Session) – Session object which manages interactions with Amazon SageMaker APIs and any other AWS services needed. If not specified, the estimator creates one using the default AWS configuration chain.

Once we create a Transformer we can transform the batch input.

[ ]:

# Define a transformer
transformer <- estimator$transformer(instance_count=1L,
                                     instance_type='ml.m4.xlarge',
                                     output_path = s3_output)

[ ]:

# Do the batch transform
transformer$transform(s3_test_url,
                     wait = TRUE)

Download the Batch Transform Output

[ ]:

# Download the file from S3 using S3Downloader to local SageMaker instance 'batch_output' folder
sagemaker$s3$S3Downloader$download(paste(s3_output,"abalone_test.csv.out",sep = '/'),
                          "batch_output")

[ ]:

# Read the batch csv from sagemaker local files
library(readr)
predictions <- read_csv(file = 'batch_output/abalone_test.csv.out', col_names = 'predicted_rings')
head(predictions)

Column-bind the predicted rings to the test data:

[ ]:

# Concatenate predictions and test for comparison
abalone_predictions <- cbind(predicted_rings = predictions,
                      abalone_test)
# Convert predictions to Integer
abalone_predictions$predicted_rings = as.integer(abalone_predictions$predicted_rings);
head(abalone_predictions)

[ ]:

# Define a function to calculate RMSE
rmse <- function(m, o){
  sqrt(mean((m - o)^2))
}

[ ]:

# Calucalte RMSE
abalone_rmse <- rmse(abalone_predictions$rings, abalone_predictions$predicted_rings)
cat('RMSE for Batch Transform: ', round(abalone_rmse, digits = 2))

Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.