AutoGluon Tabular with SageMaker
This notebook’s CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.
AutoGluon automates machine learning tasks enabling you to easily achieve strong predictive performance in your applications. With just a few lines of code, you can train and deploy high-accuracy deep learning models on tabular, image, and text data. This notebook shows how to use AutoGluon-Tabular with Amazon SageMaker by creating custom containers.
Prerequisites
If using a SageMaker hosted notebook, select kernel conda_mxnet_p36
.
[ ]:
import subprocess
# Make sure docker compose is set up properly for local mode
subprocess.run("./setup.sh", shell=True)
[ ]:
# For Studio
subprocess.run("apt-get update -y", shell=True)
subprocess.run("apt install unzip", shell=True)
subprocess.run("pip install ipywidgets", shell=True)
[ ]:
import os
import sys
import boto3
import sagemaker
from time import sleep
from collections import Counter
import numpy as np
import pandas as pd
from sagemaker import get_execution_role, local, Model, utils, s3
from sagemaker.estimator import Estimator
from sagemaker.predictor import Predictor
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import StringDeserializer
from sklearn.metrics import accuracy_score, classification_report
from IPython.core.display import display, HTML
from IPython.core.interactiveshell import InteractiveShell
# Print settings
InteractiveShell.ast_node_interactivity = "all"
pd.set_option("display.max_columns", 500)
pd.set_option("display.max_rows", 10)
# Account/s3 setup
session = sagemaker.Session()
local_session = local.LocalSession()
bucket = session.default_bucket()
prefix = "sagemaker/autogluon-tabular"
region = session.boto_region_name
role = get_execution_role()
client = session.boto_session.client(
"sts", region_name=region, endpoint_url=utils.sts_regional_endpoint(region)
)
account = client.get_caller_identity()["Account"]
registry_uri_training = sagemaker.image_uris.retrieve(
"mxnet",
region,
version="1.7.0",
py_version="py3",
instance_type="ml.m5.2xlarge",
image_scope="training",
)
registry_uri_inference = sagemaker.image_uris.retrieve(
"mxnet",
region,
version="1.7.0",
py_version="py3",
instance_type="ml.m5.2xlarge",
image_scope="inference",
)
ecr_uri_prefix = account + "." + ".".join(registry_uri_training.split("/")[0].split(".")[1:])
Build docker images
Build the training/inference image and push to ECR
[ ]:
training_algorithm_name = "autogluon-sagemaker-training"
inference_algorithm_name = "autogluon-sagemaker-inference"
First, you may want to remove existing docker images to make a room to build autogluon containers.
[ ]:
subprocess.run("docker system prune -af", shell=True)
[ ]:
subprocess.run(
f"/bin/bash ./container-training/build_push_training.sh {account} {region} {training_algorithm_name} {ecr_uri_prefix} {registry_uri_training.split('/')[0].split('.')[0]} {registry_uri_training}",
shell=True,
)
subprocess.run("docker system prune -af", shell=True)
[ ]:
subprocess.run(
f"/bin/bash ./container-inference/build_push_inference.sh {account} {region} {inference_algorithm_name} {ecr_uri_prefix} {registry_uri_training.split('/')[0].split('.')[0]} {registry_uri_inference}",
shell=True,
)
subprocess.run("docker system prune -af", shell=True)
Alternative way of building docker images using sm-docker
The new Amazon SageMaker Studio Image Build convenience package allows data scientists and developers to easily build custom container images from your Studio notebooks via a new CLI. Newly built Docker images are tagged and pushed to Amazon ECR.
To use the CLI, you need to ensure the Amazon SageMaker execution role used by your Studio notebook environment (or another AWS Identity and Access Management (IAM) role, if you prefer) has the required permissions to interact with the resources used by the CLI, including access to CodeBuild and Amazon ECR. Your role should have a trust policy with CodeBuild.
You also need to make sure the appropriate permissions are included in your role to run the build in CodeBuild, create a repository in Amazon ECR, and push images to that repository.
[ ]:
# subprocess.run("pip install sagemaker-studio-image-build", shell=True)
[ ]:
"""
training_repo_name = training_algorithm_name + ':latest'
training_repo_name
!sm-docker build . --repository {training_repo_name} \
--file ./container-training/Dockerfile.training --build-arg REGISTRY_URI={registry_uri_training}
inference_repo_name = inference_algorithm_name + ':latest'
inference_repo_name
!sm-docker build . --repository {inference_repo_name} \
--file ./container-inference/Dockerfile.inference --build-arg REGISTRY_URI={registry_uri_inference}
"""
Get the data
[ ]:
# Download the data
subprocess.run("mkdir bank-additional", shell=True)
s3 = boto3.client("s3")
s3.download_file(
f"sagemaker-example-files-prod-{region}",
"datasets/tabular/uci_bank_marketing/bank-additional-full.csv",
"bank-additional/bank-additional-full.csv",
)
local_data_path = "./bank-additional/bank-additional-full.csv"
data = pd.read_csv(local_data_path)
# Split train/test data
train = data.sample(frac=0.7, random_state=42)
test = data.drop(train.index)
# Split test X/y
label = "y"
y_test = test[label]
X_test = test.drop(columns=[label])
Check the data
[ ]:
train.head(3)
train.shape
test.head(3)
test.shape
X_test.head(3)
X_test.shape
Upload the data to s3
[ ]:
train_file = "train.csv"
train.to_csv(train_file, index=False)
train_s3_path = session.upload_data(train_file, key_prefix="{}/data".format(prefix))
test_file = "test.csv"
test.to_csv(test_file, index=False)
test_s3_path = session.upload_data(test_file, key_prefix="{}/data".format(prefix))
X_test_file = "X_test.csv"
X_test.to_csv(X_test_file, index=False)
X_test_s3_path = session.upload_data(X_test_file, key_prefix="{}/data".format(prefix))
Hyperparameter Selection
The minimum required settings for training is just a target label, init_args['label']
.
Additional optional hyperparameters can be passed to the autogluon.tabular.TabularPredictor.fit
function via fit_args
.
Below shows a more in depth example of AutoGluon-Tabular hyperparameters from the example Predicting Columns in a Table - In Depth. Please see fit parameters for further information. Note that in order for hyperparameter ranges to work in SageMaker, values passed to the fit_args['hyperparameters']
must be
represented as strings.
nn_options = {
'num_epochs': "10",
'learning_rate': "ag.space.Real(1e-4, 1e-2, default=5e-4, log=True)",
'activation': "ag.space.Categorical('relu', 'softrelu', 'tanh')",
'layers': "ag.space.Categorical([100],[1000],[200,100],[300,200,100])",
'dropout_prob': "ag.space.Real(0.0, 0.5, default=0.1)"
}
gbm_options = {
'num_boost_round': "100",
'num_leaves': "ag.space.Int(lower=26, upper=66, default=36)"
}
model_hps = {'NN': nn_options, 'GBM': gbm_options}
init_args = {
'eval_metric' : 'roc_auc'
'label': 'y'
}
fit_args = {
'presets': ['best_quality', 'optimize_for_deployment'],
'time_limits': 60*10,
'hyperparameters': model_hps,
'hyperparameter_tune': True,
'search_strategy': 'skopt'
}
hyperparameters = {
'fit_args': fit_args,
'feature_importance': True
}
Note: Your hyperparameter choices may affect the size of the model package, which could result in additional time taken to upload your model and complete training. Including 'optimize_for_deployment'
in the list of fit_args['presets']
is recommended to greatly reduce upload times.
[ ]:
# Define required label and optional additional parameters
init_args = {"label": "y"}
# Define additional parameters
fit_args = {
# Adding 'best_quality' to presets list will result in better performance (but longer runtime)
"presets": ["optimize_for_deployment"],
}
# Pass fit_args to SageMaker estimator hyperparameters
hyperparameters = {"init_args": init_args, "fit_args": fit_args, "feature_importance": True}
tags = [{"Key": "AlgorithmName", "Value": "AutoGluon-Tabular"}]
Train
train_instance_type
to local
.ml.m5.2xlarge
.Note: Depending on how many underlying models are trained, train_volume_size
may need to be increased so that they all fit on disk.
[ ]:
%%time
instance_type = "ml.m5.2xlarge"
# instance_type = 'local'
ecr_image = f"{ecr_uri_prefix}/{training_algorithm_name}:latest"
estimator = Estimator(
image_uri=ecr_image,
role=role,
instance_count=1,
instance_type=instance_type,
hyperparameters=hyperparameters,
volume_size=100,
tags=tags,
)
# Set inputs. Test data is optional, but requires a label column.
inputs = {"training": train_s3_path, "testing": test_s3_path}
estimator.fit(inputs)
Review the performance of the trained model
[ ]:
from utils.ag_utils import launch_viewer
launch_viewer(is_debug=False)
Create Model
[ ]:
# Create predictor object
class AutoGluonTabularPredictor(Predictor):
def __init__(self, *args, **kwargs):
super().__init__(
*args, serializer=CSVSerializer(), deserializer=StringDeserializer(), **kwargs
)
[ ]:
ecr_image = f"{ecr_uri_prefix}/{inference_algorithm_name}:latest"
if instance_type == "local":
model = estimator.create_model(image_uri=ecr_image, role=role)
else:
# model_uri = os.path.join(estimator.output_path, estimator._current_job_name, "output", "model.tar.gz")
model_uri = estimator.model_data
model = Model(
ecr_image,
model_data=model_uri,
role=role,
sagemaker_session=session,
predictor_cls=AutoGluonTabularPredictor,
)
Batch Transform
For local mode, either s3://<bucket>/<prefix>/output/
or file:///<absolute_local_path>
can be used as outputs.
By including the label column in the test data, you can also evaluate prediction performance (In this case, passing test_s3_path
instead of X_test_s3_path
).
[ ]:
output_path = f"s3://{bucket}/{prefix}/output/"
# output_path = f'file://{os.getcwd()}'
transformer = model.transformer(
instance_count=1,
instance_type=instance_type,
strategy="MultiRecord",
max_payload=6,
max_concurrent_transforms=1,
output_path=output_path,
)
transformer.transform(test_s3_path, content_type="text/csv", split_type="Line")
transformer.wait()
Endpoint
Deploy remote or local endpoint
[ ]:
instance_type = "ml.m5.2xlarge"
# instance_type = 'local'
predictor = model.deploy(initial_instance_count=1, instance_type=instance_type)
Attach to endpoint (or reattach if kernel was restarted)
[ ]:
# Select standard or local session based on instance_type
if instance_type == "local":
sess = local_session
else:
sess = session
# Attach to endpoint
predictor = AutoGluonTabularPredictor(predictor.endpoint_name, sagemaker_session=sess)
Predict on unlabeled test data
[ ]:
results = predictor.predict(X_test.to_csv(index=False)).splitlines()
# Check output
threshold = 0.5
y_results = np.array(["yes" if float(i.split(",")[1]) > threshold else "no" for i in results])
print(Counter(y_results))
Predict on data that includes label column
Prediction performance metrics will be printed to endpoint logs.
[ ]:
results = predictor.predict(test.to_csv(index=False)).splitlines()
# Check output
threshold = 0.5
y_results = np.array(["yes" if float(i.split(",")[1]) > threshold else "no" for i in results])
print(Counter(y_results))
Check that classification performance metrics match evaluation printed to endpoint logs as expected
[ ]:
threshold = 0.5
y_results = np.array(["yes" if float(i.split(",")[1]) > threshold else "no" for i in results])
print("accuracy: {}".format(accuracy_score(y_true=y_test, y_pred=y_results)))
print(classification_report(y_true=y_test, y_pred=y_results, digits=6))
Clean up endpoint
[ ]:
predictor.delete_endpoint()
Explainability with Amazon SageMaker Clarify
There are growing business needs and legislative regulations that require explainations of why a model made a certain decision. SHAP (SHapley Additive exPlanations) is an approach to explain the output of machine learning models. SHAP values represent a feature’s contribution to a change in the model output. SageMaker Clarify uses SHAP to explain the contribution that each input feature makes to the final decision.
[ ]:
seed = 0
num_rows = 500
# Write a csv file used by SageMaker Clarify
test_explainavility_file = "test_explainavility.csv"
train.head(num_rows).to_csv(test_explainavility_file, index=False, header=False)
test_explainavility_s3_path = session.upload_data(
test_explainavility_file, key_prefix="{}/data".format(prefix)
)
[ ]:
from sagemaker import clarify
model_name = estimator.latest_training_job.job_name
container_def = model.prepare_container_def()
session.create_model(model_name, role, container_def)
clarify_processor = clarify.SageMakerClarifyProcessor(
role=role, instance_count=1, instance_type="ml.c4.xlarge", sagemaker_session=session
)
model_config = clarify.ModelConfig(
model_name=model_name, instance_type="ml.c5.xlarge", instance_count=1, accept_type="text/csv"
)
[ ]:
shap_config = clarify.SHAPConfig(
baseline=X_test.sample(15, random_state=seed).values.tolist(),
num_samples=100,
agg_method="mean_abs",
)
explainability_output_path = "s3://{}/{}/{}/clarify-explainability".format(
bucket, prefix, model_name
)
explainability_data_config = clarify.DataConfig(
s3_data_input_path=test_explainavility_s3_path,
s3_output_path=explainability_output_path,
label="y",
headers=train.columns.to_list(),
dataset_type="text/csv",
)
predictions_config = clarify.ModelPredictedLabelConfig(probability_threshold=0.5)
clarify_processor.run_explainability(
data_config=explainability_data_config,
model_config=model_config,
explainability_config=shap_config,
)
You can view the explainability report in Studio under the experiments tab. If you’re not a Studio user yet, as with the Bias Report, you can access this report at the following S3 bucket.
[ ]:
subprocess.run(f"aws s3 cp {explainability_output_path} . --recursive", shell=True)
Global explanatory methods allow understanding the model and its feature contributions in aggregate over multiple datapoints. Here we show an aggregate bar plot that plots the mean absolute SHAP value for each feature.
[ ]:
subprocess.run(f"{sys.executable} -m pip install shap", shell=True)
[ ]:
shap_values_ = pd.read_csv("explanations_shap/out.csv")
shap_values_.abs().mean().to_dict()
[ ]:
num_features = len(train.head(num_rows).drop(["y"], axis=1).columns)
[ ]:
import shap
shap_values = [shap_values_.to_numpy()[:, :num_features], shap_values_.to_numpy()[:, num_features:]]
shap.summary_plot(
shap_values,
plot_type="bar",
feature_names=train.head(num_rows).drop(["y"], axis=1).columns.tolist(),
)
The detailed summary plot below can provide more context over the above bar chart. It tells which features are most important and, in addition, their range of effects over the dataset. The color allows us to match how changes in the value of a feature effect the change in prediction. The ‘red’ indicates higher value of the feature and ‘blue’ indicates lower (normalized over the features).
[ ]:
shap.summary_plot(
shap_values_[shap_values_.columns[20:]].to_numpy(), train.head(num_rows).drop(["y"], axis=1)
)
Notebook CI Test Results
This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.