Train a SKLearn Model using Script Mode

The aim of this notebook is to demonstrate how to train and deploy a scikit-learn model in Amazon SageMaker. The method used is called Script Mode, in which we write a script to train our model and submit it to the SageMaker Python SDK. For more information, feel free to read Using Scikit-learn with the SageMaker Python SDK.

Runtime

This notebook takes approximately 15 minutes to run.

Contents

  1. Download data

  2. Prepare data

  3. Train model

  4. Deploy and test endpoint

  5. Cleanup

Download data

Download the Iris Data Set, which is the data used to trained the model in this demo.

[2]:
import boto3
import pandas as pd
import numpy as np

s3 = boto3.client("s3")
s3.download_file(f"sagemaker-sample-files", "datasets/tabular/iris/iris.data", "iris.data")

df = pd.read_csv(
    "iris.data", header=None, names=["sepal_len", "sepal_wid", "petal_len", "petal_wid", "class"]
)
df.head()
[2]:
sepal_len sepal_wid petal_len petal_wid class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa

Prepare data

Next, we prepare the data for training by first converting the labels from string to integers. Then we split the data into a train dataset (80% of the data) and test dataset (the remaining 20% of the data) before saving them into CSV files. Then, these files are uploaded to S3 where the SageMaker SDK can access and use them to train the model.

[3]:
# Convert the three classes from strings to integers in {0,1,2}
df["class_cat"] = df["class"].astype("category").cat.codes
categories_map = dict(enumerate(df["class"].astype("category").cat.categories))
print(categories_map)
df.head()
{0: 'Iris-setosa', 1: 'Iris-versicolor', 2: 'Iris-virginica'}
[3]:
sepal_len sepal_wid petal_len petal_wid class class_cat
0 5.1 3.5 1.4 0.2 Iris-setosa 0
1 4.9 3.0 1.4 0.2 Iris-setosa 0
2 4.7 3.2 1.3 0.2 Iris-setosa 0
3 4.6 3.1 1.5 0.2 Iris-setosa 0
4 5.0 3.6 1.4 0.2 Iris-setosa 0
[4]:
# Split the data into 80-20 train-test split
num_samples = df.shape[0]
split = round(num_samples * 0.8)
train = df.iloc[:split, :]
test = df.iloc[split:, :]
print("{} train, {} test".format(split, num_samples - split))
120 train, 30 test
[5]:
# Write train and test CSV files
train.to_csv("train.csv", index=False)
test.to_csv("test.csv", index=False)
[6]:
# Create a sagemaker session to upload data to S3
import sagemaker

sagemaker_session = sagemaker.Session()

# Upload data to default S3 bucket
prefix = "DEMO-sklearn-iris"
training_input_path = sagemaker_session.upload_data("train.csv", key_prefix=prefix + "/training")

Train model

The model is trained using the SageMaker SDK’s Estimator class. Firstly, get the execution role for training. This role allows us to access the S3 bucket in the last step, where the train and test data set is located.

[7]:
# Use the current execution role for training. It needs access to S3
role = sagemaker.get_execution_role()
print(role)
arn:aws:iam::000000000000:role/ProdBuildSystemStack-ReleaseBuildRoleFB326D49-QK8LUA2UI1IC

Then, it is time to define the SageMaker SDK Estimator class. We use an Estimator class specifically desgined to train scikit-learn models called SKLearn. In this estimator, we define the following parameters: 1. The script that we want to use to train the model (i.e. entry_point). This is the heart of the Script Mode method. Additionally, set the script_mode parameter to True. 1. The role which allows us access to the S3 bucket containing the train and test data set (i.e. role) 1. How many instances we want to use in training (i.e. instance_count) and what type of instance we want to use in training (i.e. instance_type) 1. Which version of scikit-learn to use (i.e. framework_version) 1. Training hyperparameters (i.e. hyperparameters)

After setting these parameters, the fit function is invoked to train the model.

[8]:
# Docs: https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/sagemaker.sklearn.html

from sagemaker.sklearn import SKLearn

sk_estimator = SKLearn(
    entry_point="train.py",
    role=role,
    instance_count=1,
    instance_type="ml.c5.xlarge",
    py_version="py3",
    framework_version="1.0-1",
    script_mode=True,
    hyperparameters={"estimators": 20},
)

# Train the estimator
sk_estimator.fit({"train": training_input_path})
2022-04-18 00:12:36 Starting - Starting the training job...
2022-04-18 00:13:05 Starting - Preparing the instances for trainingProfilerReport-1650240755: InProgress
......
2022-04-18 00:14:06 Downloading - Downloading input data...
2022-04-18 00:14:34 Training - Downloading the training image.....2022-04-18 00:15:09,496 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training
2022-04-18 00:15:09,499 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
2022-04-18 00:15:09,510 sagemaker_sklearn_container.training INFO     Invoking user training script.
2022-04-18 00:15:09,813 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
2022-04-18 00:15:09,826 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
2022-04-18 00:15:09,838 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
2022-04-18 00:15:09,851 sagemaker-training-toolkit INFO     Invoking user script
Training Env:
{
    "additional_framework_parameters": {},
    "channel_input_dirs": {
        "train": "/opt/ml/input/data/train"
    },
    "current_host": "algo-1",
    "framework_module": "sagemaker_sklearn_container.training:main",
    "hosts": [
        "algo-1"
    ],
    "hyperparameters": {
        "estimators": 20
    },
    "input_config_dir": "/opt/ml/input/config",
    "input_data_config": {
        "train": {
            "TrainingInputMode": "File",
            "S3DistributionType": "FullyReplicated",
            "RecordWrapperType": "None"
        }
    },
    "input_dir": "/opt/ml/input",
    "is_master": true,
    "job_name": "sagemaker-scikit-learn-2022-04-18-00-12-35-728",
    "log_level": 20,
    "master_hostname": "algo-1",
    "model_dir": "/opt/ml/model",
    "module_dir": "s3://sagemaker-us-west-2-000000000000/sagemaker-scikit-learn-2022-04-18-00-12-35-728/source/sourcedir.tar.gz",
    "module_name": "train",
    "network_interface_name": "eth0",
    "num_cpus": 4,
    "num_gpus": 0,
    "output_data_dir": "/opt/ml/output/data",
    "output_dir": "/opt/ml/output",
    "output_intermediate_dir": "/opt/ml/output/intermediate",
    "resource_config": {
        "current_host": "algo-1",
        "current_instance_type": "ml.c5.xlarge",
        "current_group_name": "homogeneousCluster",
        "hosts": [
            "algo-1"
        ],
        "instance_groups": [
            {
                "instance_group_name": "homogeneousCluster",
                "instance_type": "ml.c5.xlarge",
                "hosts": [
                    "algo-1"
                ]
            }
        ],
        "network_interface_name": "eth0"
    },
    "user_entry_point": "train.py"
}
Environment variables:
SM_HOSTS=["algo-1"]
SM_NETWORK_INTERFACE_NAME=eth0
SM_HPS={"estimators":20}
SM_USER_ENTRY_POINT=train.py
SM_FRAMEWORK_PARAMS={}
SM_RESOURCE_CONFIG={"current_group_name":"homogeneousCluster","current_host":"algo-1","current_instance_type":"ml.c5.xlarge","hosts":["algo-1"],"instance_groups":[{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.c5.xlarge"}],"network_interface_name":"eth0"}
SM_INPUT_DATA_CONFIG={"train":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}}
SM_OUTPUT_DATA_DIR=/opt/ml/output/data
SM_CHANNELS=["train"]
SM_CURRENT_HOST=algo-1
SM_MODULE_NAME=train
SM_LOG_LEVEL=20
SM_FRAMEWORK_MODULE=sagemaker_sklearn_container.training:main
SM_INPUT_DIR=/opt/ml/input
SM_INPUT_CONFIG_DIR=/opt/ml/input/config
SM_OUTPUT_DIR=/opt/ml/output
SM_NUM_CPUS=4
SM_NUM_GPUS=0
SM_MODEL_DIR=/opt/ml/model
SM_MODULE_DIR=s3://sagemaker-us-west-2-000000000000/sagemaker-scikit-learn-2022-04-18-00-12-35-728/source/sourcedir.tar.gz
SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{"train":"/opt/ml/input/data/train"},"current_host":"algo-1","framework_module":"sagemaker_sklearn_container.training:main","hosts":["algo-1"],"hyperparameters":{"estimators":20},"input_config_dir":"/opt/ml/input/config","input_data_config":{"train":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}},"input_dir":"/opt/ml/input","is_master":true,"job_name":"sagemaker-scikit-learn-2022-04-18-00-12-35-728","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://sagemaker-us-west-2-000000000000/sagemaker-scikit-learn-2022-04-18-00-12-35-728/source/sourcedir.tar.gz","module_name":"train","network_interface_name":"eth0","num_cpus":4,"num_gpus":0,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_group_name":"homogeneousCluster","current_host":"algo-1","current_instance_type":"ml.c5.xlarge","hosts":["algo-1"],"instance_groups":[{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.c5.xlarge"}],"network_interface_name":"eth0"},"user_entry_point":"train.py"}
SM_USER_ARGS=["--estimators","20"]
SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
SM_CHANNEL_TRAIN=/opt/ml/input/data/train
SM_HP_ESTIMATORS=20
PYTHONPATH=/opt/ml/code:/miniconda3/bin:/miniconda3/lib/python37.zip:/miniconda3/lib/python3.7:/miniconda3/lib/python3.7/lib-dynload:/miniconda3/lib/python3.7/site-packages
Invoking script with the following command:
/miniconda3/bin/python train.py --estimators 20
2022-04-18 00:15:11,397 sagemaker-containers INFO     Reporting training SUCCESS

2022-04-18 00:15:34 Uploading - Uploading generated training model
2022-04-18 00:15:34 Completed - Training job completed
Training seconds: 82
Billable seconds: 82

Deploy and test endpoint

After training the model, it is time to deploy it as an endpoint. To do so, we invoke the deploy function within the scikit-learn estimator. As shown in the code below, one can define the number of instances (i.e. initial_instance_count) and instance type (i.e. instance_type) used to deploy the model.

[9]:
import time

sk_endpoint_name = "sklearn-rf-model" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
sk_predictor = sk_estimator.deploy(
    initial_instance_count=1, instance_type="ml.m5.large", endpoint_name=sk_endpoint_name
)
------!

After the endpoint has been completely deployed, it can be invoked using the SageMaker Runtime Client (which is the method used in the code cell below) or Scikit Learn Predictor. If you plan to use the latter method, make sure to use a Serializer to serialize your data properly.

[10]:
import json

client = sagemaker_session.sagemaker_runtime_client

request_body = {"Input": [[9.0, 3571, 1976, 0.525]]}
data = json.loads(json.dumps(request_body))
payload = json.dumps(data)

response = client.invoke_endpoint(
    EndpointName=sk_endpoint_name, ContentType="application/json", Body=payload
)

result = json.loads(response["Body"].read().decode())["Output"]
print("Predicted class category {} ({})".format(result, categories_map[result]))
Predicted class category 1 (Iris-versicolor)

Cleanup

If the model and endpoint are no longer in use, they should be deleted to save costs and free up resources.

[11]:
sk_predictor.delete_model()
sk_predictor.delete_endpoint()