AutoGluon Tabular with SageMaker

This notebook’s CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

AutoGluon automates machine learning tasks enabling you to easily achieve strong predictive performance in your applications. With just a few lines of code, you can train and deploy high-accuracy deep learning models on tabular, image, and text data. This notebook shows how to use AutoGluon-Tabular with Amazon SageMaker by creating custom containers.

Prerequisites

If using a SageMaker hosted notebook, select kernel conda_mxnet_p36.

[ ]:

import subprocess

# Make sure docker compose is set up properly for local mode
subprocess.run("./setup.sh", shell=True)

[ ]:

# For Studio
subprocess.run("apt-get update -y", shell=True)
subprocess.run("apt install unzip", shell=True)
subprocess.run("pip install ipywidgets", shell=True)

[ ]:

import os
import sys
import boto3
import sagemaker
from time import sleep
from collections import Counter
import numpy as np
import pandas as pd
from sagemaker import get_execution_role, local, Model, utils, s3
from sagemaker.estimator import Estimator
from sagemaker.predictor import Predictor
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import StringDeserializer
from sklearn.metrics import accuracy_score, classification_report
from IPython.core.display import display, HTML
from IPython.core.interactiveshell import InteractiveShell

# Print settings
InteractiveShell.ast_node_interactivity = "all"
pd.set_option("display.max_columns", 500)
pd.set_option("display.max_rows", 10)

# Account/s3 setup
session = sagemaker.Session()
local_session = local.LocalSession()
bucket = session.default_bucket()
prefix = "sagemaker/autogluon-tabular"
region = session.boto_region_name
role = get_execution_role()
client = session.boto_session.client(
    "sts", region_name=region, endpoint_url=utils.sts_regional_endpoint(region)
)
account = client.get_caller_identity()["Account"]

registry_uri_training = sagemaker.image_uris.retrieve(
    "mxnet",
    region,
    version="1.7.0",
    py_version="py3",
    instance_type="ml.m5.2xlarge",
    image_scope="training",
)
registry_uri_inference = sagemaker.image_uris.retrieve(
    "mxnet",
    region,
    version="1.7.0",
    py_version="py3",
    instance_type="ml.m5.2xlarge",
    image_scope="inference",
)
ecr_uri_prefix = account + "." + ".".join(registry_uri_training.split("/")[0].split(".")[1:])

Build docker images

Build the training/inference image and push to ECR

[ ]:

training_algorithm_name = "autogluon-sagemaker-training"
inference_algorithm_name = "autogluon-sagemaker-inference"

First, you may want to remove existing docker images to make a room to build autogluon containers.

[ ]:

subprocess.run("docker system prune -af", shell=True)

[ ]:

subprocess.run(
    f"/bin/bash ./container-training/build_push_training.sh {account} {region} {training_algorithm_name} {ecr_uri_prefix} {registry_uri_training.split('/')[0].split('.')[0]} {registry_uri_training}",
    shell=True,
)
subprocess.run("docker system prune -af", shell=True)

[ ]:

subprocess.run(
    f"/bin/bash ./container-inference/build_push_inference.sh {account} {region} {inference_algorithm_name} {ecr_uri_prefix} {registry_uri_training.split('/')[0].split('.')[0]} {registry_uri_inference}",
    shell=True,
)
subprocess.run("docker system prune -af", shell=True)

Alternative way of building docker images using sm-docker

The new Amazon SageMaker Studio Image Build convenience package allows data scientists and developers to easily build custom container images from your Studio notebooks via a new CLI. Newly built Docker images are tagged and pushed to Amazon ECR.

To use the CLI, you need to ensure the Amazon SageMaker execution role used by your Studio notebook environment (or another AWS Identity and Access Management (IAM) role, if you prefer) has the required permissions to interact with the resources used by the CLI, including access to CodeBuild and Amazon ECR. Your role should have a trust policy with CodeBuild.

You also need to make sure the appropriate permissions are included in your role to run the build in CodeBuild, create a repository in Amazon ECR, and push images to that repository.

[ ]:

# subprocess.run("pip install sagemaker-studio-image-build", shell=True)

[ ]:

"""
training_repo_name = training_algorithm_name + ':latest'
training_repo_name

!sm-docker build . --repository {training_repo_name} \
--file ./container-training/Dockerfile.training --build-arg REGISTRY_URI={registry_uri_training}

inference_repo_name = inference_algorithm_name + ':latest'
inference_repo_name

!sm-docker build . --repository {inference_repo_name} \
--file ./container-inference/Dockerfile.inference --build-arg REGISTRY_URI={registry_uri_inference}
"""

Get the data

In this example we’ll use the direct-marketing dataset to build a binary classification model that predicts whether customers will accept or decline a marketing offer.

First we’ll download the data and split it into train and test sets. AutoGluon does not require a separate validation set (it uses bagged k-fold cross-validation).

[ ]:

# Download the data
subprocess.run("mkdir bank-additional", shell=True)
s3 = boto3.client("s3")
s3.download_file(
    f"sagemaker-example-files-prod-{region}",
    "datasets/tabular/uci_bank_marketing/bank-additional-full.csv",
    "bank-additional/bank-additional-full.csv",
)

local_data_path = "./bank-additional/bank-additional-full.csv"
data = pd.read_csv(local_data_path)

# Split train/test data
train = data.sample(frac=0.7, random_state=42)
test = data.drop(train.index)

# Split test X/y
label = "y"
y_test = test[label]
X_test = test.drop(columns=[label])

Check the data

[ ]:

train.head(3)
train.shape

test.head(3)
test.shape

X_test.head(3)
X_test.shape

Upload the data to s3

[ ]:

train_file = "train.csv"
train.to_csv(train_file, index=False)
train_s3_path = session.upload_data(train_file, key_prefix="{}/data".format(prefix))

test_file = "test.csv"
test.to_csv(test_file, index=False)
test_s3_path = session.upload_data(test_file, key_prefix="{}/data".format(prefix))

X_test_file = "X_test.csv"
X_test.to_csv(X_test_file, index=False)
X_test_s3_path = session.upload_data(X_test_file, key_prefix="{}/data".format(prefix))

Hyperparameter Selection

The minimum required settings for training is just a target label, init_args['label'].

Additional optional hyperparameters can be passed to the autogluon.tabular.TabularPredictor.fit function via fit_args.

Below shows a more in depth example of AutoGluon-Tabular hyperparameters from the example Predicting Columns in a Table - In Depth. Please see fit parameters for further information. Note that in order for hyperparameter ranges to work in SageMaker, values passed to the fit_args['hyperparameters'] must be represented as strings.

nn_options = {
    'num_epochs': "10",
    'learning_rate': "ag.space.Real(1e-4, 1e-2, default=5e-4, log=True)",
    'activation': "ag.space.Categorical('relu', 'softrelu', 'tanh')",
    'layers': "ag.space.Categorical([100],[1000],[200,100],[300,200,100])",
    'dropout_prob': "ag.space.Real(0.0, 0.5, default=0.1)"
}

gbm_options = {
    'num_boost_round': "100",
    'num_leaves': "ag.space.Int(lower=26, upper=66, default=36)"
}

model_hps = {'NN': nn_options, 'GBM': gbm_options}

init_args = {
  'eval_metric' : 'roc_auc'
  'label': 'y'
}

fit_args = {
  'presets': ['best_quality', 'optimize_for_deployment'],
  'time_limits': 60*10,
  'hyperparameters': model_hps,
  'hyperparameter_tune': True,
  'search_strategy': 'skopt'
}


hyperparameters = {
  'fit_args': fit_args,
  'feature_importance': True
}

Note: Your hyperparameter choices may affect the size of the model package, which could result in additional time taken to upload your model and complete training. Including 'optimize_for_deployment' in the list of fit_args['presets'] is recommended to greatly reduce upload times.

[ ]:

# Define required label and optional additional parameters
init_args = {"label": "y"}

# Define additional parameters
fit_args = {
    # Adding 'best_quality' to presets list will result in better performance (but longer runtime)
    "presets": ["optimize_for_deployment"],
}

# Pass fit_args to SageMaker estimator hyperparameters
hyperparameters = {"init_args": init_args, "fit_args": fit_args, "feature_importance": True}

tags = [{"Key": "AlgorithmName", "Value": "AutoGluon-Tabular"}]

Train

For local training set train_instance_type to local .

For non-local training the recommended instance type is ml.m5.2xlarge.

Note: Depending on how many underlying models are trained, train_volume_size may need to be increased so that they all fit on disk.

[ ]:

%%time

instance_type = "ml.m5.2xlarge"
# instance_type = 'local'

ecr_image = f"{ecr_uri_prefix}/{training_algorithm_name}:latest"

estimator = Estimator(
    image_uri=ecr_image,
    role=role,
    instance_count=1,
    instance_type=instance_type,
    hyperparameters=hyperparameters,
    volume_size=100,
    tags=tags,
)

# Set inputs. Test data is optional, but requires a label column.
inputs = {"training": train_s3_path, "testing": test_s3_path}

estimator.fit(inputs)

Review the performance of the trained model

[ ]:

from utils.ag_utils import launch_viewer

launch_viewer(is_debug=False)

Create Model

[ ]:

# Create predictor object
class AutoGluonTabularPredictor(Predictor):
    def __init__(self, *args, **kwargs):
        super().__init__(
            *args, serializer=CSVSerializer(), deserializer=StringDeserializer(), **kwargs
        )

[ ]:

ecr_image = f"{ecr_uri_prefix}/{inference_algorithm_name}:latest"

if instance_type == "local":
    model = estimator.create_model(image_uri=ecr_image, role=role)
else:
    # model_uri = os.path.join(estimator.output_path, estimator._current_job_name, "output", "model.tar.gz")
    model_uri = estimator.model_data
    model = Model(
        ecr_image,
        model_data=model_uri,
        role=role,
        sagemaker_session=session,
        predictor_cls=AutoGluonTabularPredictor,
    )

Batch Transform

For local mode, either s3://<bucket>/<prefix>/output/ or file:///<absolute_local_path> can be used as outputs.

By including the label column in the test data, you can also evaluate prediction performance (In this case, passing test_s3_path instead of X_test_s3_path).

[ ]:

output_path = f"s3://{bucket}/{prefix}/output/"
# output_path = f'file://{os.getcwd()}'

transformer = model.transformer(
    instance_count=1,
    instance_type=instance_type,
    strategy="MultiRecord",
    max_payload=6,
    max_concurrent_transforms=1,
    output_path=output_path,
)

transformer.transform(test_s3_path, content_type="text/csv", split_type="Line")
transformer.wait()

Endpoint

Deploy remote or local endpoint

[ ]:

instance_type = "ml.m5.2xlarge"
# instance_type = 'local'

predictor = model.deploy(initial_instance_count=1, instance_type=instance_type)

Attach to endpoint (or reattach if kernel was restarted)

[ ]:

# Select standard or local session based on instance_type
if instance_type == "local":
    sess = local_session
else:
    sess = session

# Attach to endpoint
predictor = AutoGluonTabularPredictor(predictor.endpoint_name, sagemaker_session=sess)

Predict on unlabeled test data

[ ]:

results = predictor.predict(X_test.to_csv(index=False)).splitlines()

# Check output
threshold = 0.5
y_results = np.array(["yes" if float(i.split(",")[1]) > threshold else "no" for i in results])

print(Counter(y_results))

Predict on data that includes label column

Prediction performance metrics will be printed to endpoint logs.

[ ]:

results = predictor.predict(test.to_csv(index=False)).splitlines()

# Check output
threshold = 0.5
y_results = np.array(["yes" if float(i.split(",")[1]) > threshold else "no" for i in results])

print(Counter(y_results))

Check that classification performance metrics match evaluation printed to endpoint logs as expected

[ ]:

threshold = 0.5
y_results = np.array(["yes" if float(i.split(",")[1]) > threshold else "no" for i in results])

print("accuracy: {}".format(accuracy_score(y_true=y_test, y_pred=y_results)))
print(classification_report(y_true=y_test, y_pred=y_results, digits=6))

Clean up endpoint

[ ]:

predictor.delete_endpoint()

Explainability with Amazon SageMaker Clarify

There are growing business needs and legislative regulations that require explainations of why a model made a certain decision. SHAP (SHapley Additive exPlanations) is an approach to explain the output of machine learning models. SHAP values represent a feature’s contribution to a change in the model output. SageMaker Clarify uses SHAP to explain the contribution that each input feature makes to the final decision.

[ ]:

seed = 0
num_rows = 500

# Write a csv file used by SageMaker Clarify
test_explainavility_file = "test_explainavility.csv"
train.head(num_rows).to_csv(test_explainavility_file, index=False, header=False)
test_explainavility_s3_path = session.upload_data(
    test_explainavility_file, key_prefix="{}/data".format(prefix)
)

[ ]:

from sagemaker import clarify

model_name = estimator.latest_training_job.job_name
container_def = model.prepare_container_def()
session.create_model(model_name, role, container_def)

clarify_processor = clarify.SageMakerClarifyProcessor(
    role=role, instance_count=1, instance_type="ml.c4.xlarge", sagemaker_session=session
)
model_config = clarify.ModelConfig(
    model_name=model_name, instance_type="ml.c5.xlarge", instance_count=1, accept_type="text/csv"
)

[ ]:

shap_config = clarify.SHAPConfig(
    baseline=X_test.sample(15, random_state=seed).values.tolist(),
    num_samples=100,
    agg_method="mean_abs",
)

explainability_output_path = "s3://{}/{}/{}/clarify-explainability".format(
    bucket, prefix, model_name
)
explainability_data_config = clarify.DataConfig(
    s3_data_input_path=test_explainavility_s3_path,
    s3_output_path=explainability_output_path,
    label="y",
    headers=train.columns.to_list(),
    dataset_type="text/csv",
)

predictions_config = clarify.ModelPredictedLabelConfig(probability_threshold=0.5)

clarify_processor.run_explainability(
    data_config=explainability_data_config,
    model_config=model_config,
    explainability_config=shap_config,
)

You can view the explainability report in Studio under the experiments tab. If you’re not a Studio user yet, as with the Bias Report, you can access this report at the following S3 bucket.

[ ]:

subprocess.run(f"aws s3 cp {explainability_output_path} . --recursive", shell=True)

Global explanatory methods allow understanding the model and its feature contributions in aggregate over multiple datapoints. Here we show an aggregate bar plot that plots the mean absolute SHAP value for each feature.

[ ]:

subprocess.run(f"{sys.executable} -m pip install shap", shell=True)

[ ]:

shap_values_ = pd.read_csv("explanations_shap/out.csv")
shap_values_.abs().mean().to_dict()

[ ]:

num_features = len(train.head(num_rows).drop(["y"], axis=1).columns)

[ ]:

import shap

shap_values = [shap_values_.to_numpy()[:, :num_features], shap_values_.to_numpy()[:, num_features:]]
shap.summary_plot(
    shap_values,
    plot_type="bar",
    feature_names=train.head(num_rows).drop(["y"], axis=1).columns.tolist(),
)

The detailed summary plot below can provide more context over the above bar chart. It tells which features are most important and, in addition, their range of effects over the dataset. The color allows us to match how changes in the value of a feature effect the change in prediction. The ‘red’ indicates higher value of the feature and ‘blue’ indicates lower (normalized over the features).

[ ]:

shap.summary_plot(
    shap_values_[shap_values_.columns[20:]].to_numpy(), train.head(num_rows).drop(["y"], axis=1)
)

Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.