Train a SKLearn Model using Script Mode
This notebook’s CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.
The aim of this notebook is to demonstrate how to train and deploy a scikit-learn model in Amazon SageMaker. The method used is called Script Mode, in which we write a script to train our model and submit it to the SageMaker Python SDK. For more information, feel free to read Using Scikit-learn with the SageMaker Python SDK.
Runtime
This notebook takes approximately 15 minutes to run.
Contents
Download data
Download the Iris Data Set, which is the data used to trained the model in this demo.
[ ]:
!pip install -U sagemaker
[2]:
import boto3
import pandas as pd
import numpy as np
s3 = boto3.client("s3")
s3.download_file(f"sagemaker-sample-files", "datasets/tabular/iris/iris.data", "iris.data")
df = pd.read_csv(
"iris.data", header=None, names=["sepal_len", "sepal_wid", "petal_len", "petal_wid", "class"]
)
df.head()
[2]:
sepal_len | sepal_wid | petal_len | petal_wid | class | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
Prepare data
Next, we prepare the data for training by first converting the labels from string to integers. Then we split the data into a train dataset (80% of the data) and test dataset (the remaining 20% of the data) before saving them into CSV files. Then, these files are uploaded to S3 where the SageMaker SDK can access and use them to train the model.
[3]:
# Convert the three classes from strings to integers in {0,1,2}
df["class_cat"] = df["class"].astype("category").cat.codes
categories_map = dict(enumerate(df["class"].astype("category").cat.categories))
print(categories_map)
df.head()
{0: 'Iris-setosa', 1: 'Iris-versicolor', 2: 'Iris-virginica'}
[3]:
sepal_len | sepal_wid | petal_len | petal_wid | class | class_cat | |
---|---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa | 0 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa | 0 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa | 0 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa | 0 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa | 0 |
[4]:
# Split the data into 80-20 train-test split
num_samples = df.shape[0]
split = round(num_samples * 0.8)
train = df.iloc[:split, :]
test = df.iloc[split:, :]
print("{} train, {} test".format(split, num_samples - split))
120 train, 30 test
[5]:
# Write train and test CSV files
train.to_csv("train.csv", index=False)
test.to_csv("test.csv", index=False)
[6]:
# Create a sagemaker session to upload data to S3
import sagemaker
sagemaker_session = sagemaker.Session()
# Upload data to default S3 bucket
prefix = "DEMO-sklearn-iris"
training_input_path = sagemaker_session.upload_data("train.csv", key_prefix=prefix + "/training")
Train model
The model is trained using the SageMaker SDK’s Estimator class. Firstly, get the execution role for training. This role allows us to access the S3 bucket in the last step, where the train and test data set is located.
[7]:
# Use the current execution role for training. It needs access to S3
role = sagemaker.get_execution_role()
print(role)
arn:aws:iam::000000000000:role/ProdBuildSystemStack-ReleaseBuildRoleFB326D49-QK8LUA2UI1IC
Then, it is time to define the SageMaker SDK Estimator class. We use an Estimator class specifically desgined to train scikit-learn models called SKLearn
. In this estimator, we define the following parameters: 1. The script that we want to use to train the model (i.e. entry_point
). This is the heart of the Script Mode method. Additionally, set the script_mode
parameter to True
. 1. The role which allows us access to the S3 bucket containing the train and test data set
(i.e. role
) 1. How many instances we want to use in training (i.e. instance_count
) and what type of instance we want to use in training (i.e. instance_type
) 1. Which version of scikit-learn to use (i.e. framework_version
) 1. Training hyperparameters (i.e. hyperparameters
)
After setting these parameters, the fit
function is invoked to train the model.
[8]:
# Docs: https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/sagemaker.sklearn.html
from sagemaker.sklearn import SKLearn
sk_estimator = SKLearn(
entry_point="train.py",
role=role,
instance_count=1,
instance_type="ml.c5.xlarge",
py_version="py3",
framework_version="1.2-1",
script_mode=True,
hyperparameters={"estimators": 20},
)
# Train the estimator
sk_estimator.fit({"train": training_input_path})
2022-04-18 00:12:36 Starting - Starting the training job...
2022-04-18 00:13:05 Starting - Preparing the instances for trainingProfilerReport-1650240755: InProgress
......
2022-04-18 00:14:06 Downloading - Downloading input data...
2022-04-18 00:14:34 Training - Downloading the training image.....2022-04-18 00:15:09,496 sagemaker-containers INFO Imported framework sagemaker_sklearn_container.training
2022-04-18 00:15:09,499 sagemaker-training-toolkit INFO No GPUs detected (normal if no gpus installed)
2022-04-18 00:15:09,510 sagemaker_sklearn_container.training INFO Invoking user training script.
2022-04-18 00:15:09,813 sagemaker-training-toolkit INFO No GPUs detected (normal if no gpus installed)
2022-04-18 00:15:09,826 sagemaker-training-toolkit INFO No GPUs detected (normal if no gpus installed)
2022-04-18 00:15:09,838 sagemaker-training-toolkit INFO No GPUs detected (normal if no gpus installed)
2022-04-18 00:15:09,851 sagemaker-training-toolkit INFO Invoking user script
Training Env:
{
"additional_framework_parameters": {},
"channel_input_dirs": {
"train": "/opt/ml/input/data/train"
},
"current_host": "algo-1",
"framework_module": "sagemaker_sklearn_container.training:main",
"hosts": [
"algo-1"
],
"hyperparameters": {
"estimators": 20
},
"input_config_dir": "/opt/ml/input/config",
"input_data_config": {
"train": {
"TrainingInputMode": "File",
"S3DistributionType": "FullyReplicated",
"RecordWrapperType": "None"
}
},
"input_dir": "/opt/ml/input",
"is_master": true,
"job_name": "sagemaker-scikit-learn-2022-04-18-00-12-35-728",
"log_level": 20,
"master_hostname": "algo-1",
"model_dir": "/opt/ml/model",
"module_dir": "s3://sagemaker-us-west-2-000000000000/sagemaker-scikit-learn-2022-04-18-00-12-35-728/source/sourcedir.tar.gz",
"module_name": "train",
"network_interface_name": "eth0",
"num_cpus": 4,
"num_gpus": 0,
"output_data_dir": "/opt/ml/output/data",
"output_dir": "/opt/ml/output",
"output_intermediate_dir": "/opt/ml/output/intermediate",
"resource_config": {
"current_host": "algo-1",
"current_instance_type": "ml.c5.xlarge",
"current_group_name": "homogeneousCluster",
"hosts": [
"algo-1"
],
"instance_groups": [
{
"instance_group_name": "homogeneousCluster",
"instance_type": "ml.c5.xlarge",
"hosts": [
"algo-1"
]
}
],
"network_interface_name": "eth0"
},
"user_entry_point": "train.py"
}
Environment variables:
SM_HOSTS=["algo-1"]
SM_NETWORK_INTERFACE_NAME=eth0
SM_HPS={"estimators":20}
SM_USER_ENTRY_POINT=train.py
SM_FRAMEWORK_PARAMS={}
SM_RESOURCE_CONFIG={"current_group_name":"homogeneousCluster","current_host":"algo-1","current_instance_type":"ml.c5.xlarge","hosts":["algo-1"],"instance_groups":[{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.c5.xlarge"}],"network_interface_name":"eth0"}
SM_INPUT_DATA_CONFIG={"train":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}}
SM_OUTPUT_DATA_DIR=/opt/ml/output/data
SM_CHANNELS=["train"]
SM_CURRENT_HOST=algo-1
SM_MODULE_NAME=train
SM_LOG_LEVEL=20
SM_FRAMEWORK_MODULE=sagemaker_sklearn_container.training:main
SM_INPUT_DIR=/opt/ml/input
SM_INPUT_CONFIG_DIR=/opt/ml/input/config
SM_OUTPUT_DIR=/opt/ml/output
SM_NUM_CPUS=4
SM_NUM_GPUS=0
SM_MODEL_DIR=/opt/ml/model
SM_MODULE_DIR=s3://sagemaker-us-west-2-000000000000/sagemaker-scikit-learn-2022-04-18-00-12-35-728/source/sourcedir.tar.gz
SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{"train":"/opt/ml/input/data/train"},"current_host":"algo-1","framework_module":"sagemaker_sklearn_container.training:main","hosts":["algo-1"],"hyperparameters":{"estimators":20},"input_config_dir":"/opt/ml/input/config","input_data_config":{"train":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}},"input_dir":"/opt/ml/input","is_master":true,"job_name":"sagemaker-scikit-learn-2022-04-18-00-12-35-728","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://sagemaker-us-west-2-000000000000/sagemaker-scikit-learn-2022-04-18-00-12-35-728/source/sourcedir.tar.gz","module_name":"train","network_interface_name":"eth0","num_cpus":4,"num_gpus":0,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_group_name":"homogeneousCluster","current_host":"algo-1","current_instance_type":"ml.c5.xlarge","hosts":["algo-1"],"instance_groups":[{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.c5.xlarge"}],"network_interface_name":"eth0"},"user_entry_point":"train.py"}
SM_USER_ARGS=["--estimators","20"]
SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
SM_CHANNEL_TRAIN=/opt/ml/input/data/train
SM_HP_ESTIMATORS=20
PYTHONPATH=/opt/ml/code:/miniconda3/bin:/miniconda3/lib/python37.zip:/miniconda3/lib/python3.7:/miniconda3/lib/python3.7/lib-dynload:/miniconda3/lib/python3.7/site-packages
Invoking script with the following command:
/miniconda3/bin/python train.py --estimators 20
2022-04-18 00:15:11,397 sagemaker-containers INFO Reporting training SUCCESS
2022-04-18 00:15:34 Uploading - Uploading generated training model
2022-04-18 00:15:34 Completed - Training job completed
Training seconds: 82
Billable seconds: 82
Deploy and test endpoint
After training the model, it is time to deploy it as an endpoint. To do so, we invoke the deploy
function within the scikit-learn estimator. As shown in the code below, one can define the number of instances (i.e. initial_instance_count
) and instance type (i.e. instance_type
) used to deploy the model.
[9]:
import time
sk_endpoint_name = "sklearn-rf-model" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
sk_predictor = sk_estimator.deploy(
initial_instance_count=1, instance_type="ml.m5.large", endpoint_name=sk_endpoint_name
)
------!
After the endpoint has been completely deployed, it can be invoked using the SageMaker Runtime Client (which is the method used in the code cell below) or Scikit Learn Predictor. If you plan to use the latter method, make sure to use a Serializer to serialize your data properly.
[10]:
import json
client = sagemaker_session.sagemaker_runtime_client
request_body = {"Input": [[9.0, 3571, 1976, 0.525]]}
data = json.loads(json.dumps(request_body))
payload = json.dumps(data)
response = client.invoke_endpoint(
EndpointName=sk_endpoint_name, ContentType="application/json", Body=payload
)
result = json.loads(response["Body"].read().decode())["Output"]
print("Predicted class category {} ({})".format(result, categories_map[result]))
Predicted class category 1 (Iris-versicolor)
Cleanup
If the model and endpoint are no longer in use, they should be deleted to save costs and free up resources.
[11]:
sk_predictor.delete_model()
sk_predictor.delete_endpoint()
Notebook CI Test Results
This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.