AutoGluon-Tabular in AWS Marketplace

This notebook’s CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

AutoGluon automates machine learning tasks enabling you to easily achieve strong predictive performance in your applications. With just a few lines of code, you can train and deploy high-accuracy deep learning models on tabular, image, and text data. This notebook shows how to use AutoGluon-Tabular in AWS Marketplace.

Contents:

Step 1: Subscribe to AutoML algorithm from AWS Marketplace
Step 2: Set up environment
Step 3: Prepare and upload data
Step 4: Train a model
Step 5: Deploy the model and perform a real-time inference
Step 6: Use Batch Transform
Step 7: Clean-up

Step 2 : Set up environment

[ ]:

# Import necessary libraries.
import os
import boto3
import sagemaker
from time import sleep
from collections import Counter
import numpy as np
import pandas as pd
from sagemaker import get_execution_role, local, Model, utils, fw_utils, s3
from sagemaker.algorithm import AlgorithmEstimator
from sagemaker.predictor import Predictor
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import StringDeserializer
from sklearn.metrics import accuracy_score, classification_report
from IPython.core.display import display, HTML
from IPython.core.interactiveshell import InteractiveShell

# Print settings
InteractiveShell.ast_node_interactivity = "all"
pd.set_option("display.max_columns", 500)
pd.set_option("display.max_rows", 10)

# Account/s3 setup
session = sagemaker.Session()
bucket = session.default_bucket()
prefix = "sagemaker/autogluon-tabular"
region = session.boto_region_name
role = get_execution_role()

[ ]:

compatible_training_instance_type = "ml.m5.4xlarge"
compatible_inference_instance_type = "ml.m5.4xlarge"

[ ]:

# Specify algorithm ARN for AutoGluon-Tabular from AWS Marketplace.  However, for this notebook, the algorithm ARN
# has been specified in src/algorithm_arns.py file and you do not need to specify the same explicitly.

from src.algorithm_arns import AlgorithmArnProvider

algorithm_arn = AlgorithmArnProvider.get_algorithm_arn(region)

Step 3: Get the data

In this example we’ll use the [1] UCI Machine Learning Repository: Adult Data Set to build a binary classification model that predicts whether customers will accept or decline a marketing offer.

First we’ll download the data and split it into train and test sets. AutoGluon does not require a separate validation set (it uses bagged k-fold cross-validation).

[1] Dua, D. and Graff, C. (2019). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science.

[ ]:

adult_columns = [
    "age",
    "workclass",
    "fnlwgt",
    "education",
    "education-num",
    "marital-status",
    "occupation",
    "relationship",
    "ethnic-group",
    "sex",
    "capital-gain",
    "capital-loss",
    "hours-per-week",
    "native-country",
    "class",
]

# Download the data
s3 = boto3.client("s3")
s3.download_file(
    f"sagemaker-example-files-prod-{region}", "datasets/tabular/uci_adult/adult.data", "adult.data"
)
s3.download_file(
    f"sagemaker-example-files-prod-{region}", "datasets/tabular/uci_adult/adult.test", "adult.test"
)

# Split train/test data
train = pd.read_csv(
    "adult.data", names=adult_columns, sep=r"\s*,\s*", engine="python", na_values="?"
)

test = pd.read_csv(
    "adult.test", names=adult_columns, sep=r"\s*,\s*", engine="python", na_values="?", skiprows=1
)


# Split test X/y
label = "class"
y_test = test[label]
X_test = test.drop(columns=[label])

Check the data

[ ]:

train.head(3)
train.shape

test.head(3)
test.shape

X_test.head(3)
X_test.shape

Upload the data to s3

[ ]:

train_file = "train.csv"
train.to_csv(train_file, index=False)
train_s3_path = session.upload_data(train_file, key_prefix="{}/data".format(prefix))

test_file = "test.csv"
test.to_csv(test_file, index=False)
test_s3_path = session.upload_data(test_file, key_prefix="{}/data".format(prefix))

X_test_file = "X_test.csv"
X_test.to_csv(X_test_file, index=False)
X_test_s3_path = session.upload_data(X_test_file, key_prefix="{}/data".format(prefix))

Step 4: Train a model

Next, let us train a model.

Note: Depending on how many underlying models are trained, train_volume_size may need to be increased so that they all fit on disk.

[ ]:

# Define required label and optional additional parameters
init_args = {"label": "class"}

# Define additional parameters
fit_args = {
    # Adding 'best_quality' to presets list will result in better performance (but longer runtime)
    "presets": ["optimize_for_deployment"],
}

# Pass fit_args to SageMaker estimator hyperparameters
hyperparameters = {"init_args": init_args, "fit_args": fit_args, "feature_importance": True}

tags = [{"Key": "AlgorithmName", "Value": "AutoGluon-Tabular"}]

[ ]:

algo = AlgorithmEstimator(
    algorithm_arn=algorithm_arn,
    role=role,
    instance_count=1,
    instance_type=compatible_training_instance_type,
    sagemaker_session=session,
    base_job_name="autogluon",
    hyperparameters=hyperparameters,
    train_volume_size=100,
)

inputs = {"training": train_s3_path}

algo.fit(inputs)

Step 5: Deploy the model and perform a real-time inference

Deploy a remote endpoint

[ ]:

%%time

predictor = algo.deploy(
    initial_instance_count=1,
    instance_type=compatible_inference_instance_type,
    serializer=CSVSerializer(),
    deserializer=StringDeserializer(),
)

Predict on unlabeled test data

[ ]:

results = predictor.predict(X_test.to_csv(index=False)).splitlines()

# Check output
y_results = np.array([i.split(",")[0] for i in results])
print(Counter(y_results))

Predict on data that includes label column

Prediction performance metrics will be printed to endpoint logs.

[ ]:

results = predictor.predict(test.to_csv(index=False)).splitlines()

# Check output
y_results = np.array([i.split(",")[0] for i in results])
print(Counter(y_results))

Check that classification performance metrics match evaluation printed to endpoint logs as expected

[ ]:

y_results = np.array([i.split(",")[0] for i in results])

print("accuracy: {}".format(accuracy_score(y_true=y_test, y_pred=y_results)))
print(classification_report(y_true=y_test, y_pred=y_results, digits=6))

Step 6: Use Batch Transform

By including the label column in the test data, you can also evaluate prediction performance (In this case, passing test_s3_path instead of X_test_s3_path).

[ ]:

output_path = f"s3://{bucket}/{prefix}/output/"

transformer = algo.transformer(
    instance_count=1,
    instance_type=compatible_inference_instance_type,
    strategy="MultiRecord",
    max_payload=6,
    max_concurrent_transforms=1,
    output_path=output_path,
)

transformer.transform(test_s3_path, content_type="text/csv", split_type="Line")
transformer.wait()

Step 7: Clean-up

Once you have finished performing predictions, you can delete the endpoint to avoid getting charged for the same.

[ ]:

predictor.delete_model()
predictor.delete_endpoint()

Finally, if the AWS Marketplace subscription was created just for the experiment and you would like to unsubscribe to the product, here are the steps that can be followed. Before you cancel the subscription, ensure that you do not have any deployable model created from the model-package or using the algorithm. Note - You can find this by looking at container associated with the model.

Steps to un-subscribe to product from AWS Marketplace: 1. Navigate to Machine Learning tab on Your Software subscriptions page 2. Locate the listing that you would need to cancel subscription for, and then Cancel Subscription can be clicked to cancel the subscription.

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.