Develop, Train, Register and Batch Transform Scikit-Learn Random Forest
This notebook’s CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.
Doc https://sagemaker.readthedocs.io/en/stable/using_sklearn.html
SDK https://sagemaker.readthedocs.io/en/stable/sagemaker.sklearn.html
boto3 https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#client
In this notebook we show how to use Amazon SageMaker to train a Scikit-learn Random Forest model, register it in Model Registry, and run a Batch Transform Job. More info on Scikit-Learn can be found here https://scikit-learn.org/stable/index.html. We use the California Housing dataset, present in Scikit-Learn: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html. The California Housing dataset was originally published in:
Pace, R. Kelley, and Ronald Barry. “Sparse spatial auto regressions.” Statistics & Probability Letters 33.3 (1997): 291-297.
Link to the paper: https://doi.org/10.1016/S0167-7152(96)00140-X
[ ]:
!pip install -U sagemaker
[ ]:
import sys
!{sys.executable} -m pip install sagemaker scikit-learn==1.2-1 --upgrade
[ ]:
import datetime
import time
import tarfile
import boto3
import pandas as pd
import numpy as np
from sagemaker import get_execution_role
import sagemaker
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing
s3 = boto3.client("s3")
sm_boto3 = boto3.client("sagemaker")
sess = sagemaker.Session()
region = sess.boto_session.region_name
bucket = sess.default_bucket() # this could also be a hard-coded bucket name
print("Using bucket " + bucket)
Prepare data
We use the California housing dataset.
More info on the dataset:
This dataset was obtained from the StatLib
repository. http://lib.stat.cmu.edu/datasets/
The target variable is the median house value for California districts.
This dataset was derived from the 1990 U.S. census, using one row per census block group. A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people).
[ ]:
s3 = boto3.client("s3")
s3.download_file(
f"sagemaker-example-files-prod-{region}",
"datasets/tabular/california_housing/cal_housing.tgz",
"cal_housing.tgz",
)
[ ]:
!tar -zxf cal_housing.tgz
[ ]:
columns = [
"longitude",
"latitude",
"housingMedianAge",
"totalRooms",
"totalBedrooms",
"population",
"households",
"medianIncome",
"medianHouseValue",
]
california_housing_df = pd.read_csv(
"CaliforniaHousing/cal_housing.data", names=columns, header=None
)
[ ]:
california_housing_df.head()
[ ]:
x_train, x_test = train_test_split(california_housing_df, test_size=0.25)
x_eval = x_test[
[
"longitude",
"latitude",
"housingMedianAge",
"totalRooms",
"totalBedrooms",
"population",
"households",
"medianIncome",
]
]
Let’s inspect the training dataset
[ ]:
x_train.head()
[ ]:
x_train.shape
Save training, testing and evaluation data as csv and upload to S3
[ ]:
x_train.to_csv("california_housing_train.csv")
x_test.to_csv("california_housing_test.csv")
x_eval.to_csv("california_housing_eval.csv", header=False, index=False)
Upload training and evaluation data to S3, as SageMaker Training Job, and afterward, Batch Transform Job will take it from there.
[ ]:
trainpath = sess.upload_data(
path="california_housing_train.csv", bucket=bucket, key_prefix="sagemaker/sklearn-train"
)
testpath = sess.upload_data(
path="california_housing_test.csv", bucket=bucket, key_prefix="sagemaker/sklearn-train"
)
print(trainpath)
print(testpath)
[ ]:
sess.upload_data(
path="california_housing_eval.csv", bucket=bucket, key_prefix="sagemaker/sklearn-eval"
)
eval_s3_prefix = f"s3://{bucket}/sagemaker/sklearn-eval/"
eval_s3_prefix
Writing a Script Mode script
The below script contains both training and inference functionality and can run both in SageMaker Training hardware or locally (desktop, SageMaker notebook, on premise, etc.). Detailed guidance here https://sagemaker.readthedocs.io/en/stable/using_sklearn.html#preparing-the-scikit-learn-training-script
[ ]:
%%writefile script.py
import argparse
import joblib
import os
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
# inference functions ---------------
def model_fn(model_dir):
clf = joblib.load(os.path.join(model_dir, "model.joblib"))
return clf
if __name__ == "__main__":
print("extracting arguments")
parser = argparse.ArgumentParser()
# hyperparameters sent by the client are passed as command-line arguments to the script.
# to simplify the demo we don't use all sklearn RandomForest hyperparameters
parser.add_argument("--n-estimators", type=int, default=10)
parser.add_argument("--min-samples-leaf", type=int, default=3)
# Data, model, and output directories
parser.add_argument("--model-dir", type=str, default=os.environ.get("SM_MODEL_DIR"))
parser.add_argument("--train", type=str, default=os.environ.get("SM_CHANNEL_TRAIN"))
parser.add_argument("--test", type=str, default=os.environ.get("SM_CHANNEL_TEST"))
parser.add_argument("--train-file", type=str, default="california_housing_train.csv")
parser.add_argument("--test-file", type=str, default="california_housing_test.csv")
parser.add_argument(
"--features", type=str
) # in this script we ask user to explicitly name features
parser.add_argument(
"--target", type=str
) # in this script we ask user to explicitly name the target
args, _ = parser.parse_known_args()
print("reading data")
train_df = pd.read_csv(os.path.join(args.train, args.train_file))
test_df = pd.read_csv(os.path.join(args.test, args.test_file))
print("building training and testing datasets")
X_train = train_df[args.features.split()]
X_test = test_df[args.features.split()]
y_train = train_df[args.target]
y_test = test_df[args.target]
# train
print("training model")
model = RandomForestRegressor(
n_estimators=args.n_estimators, min_samples_leaf=args.min_samples_leaf, n_jobs=-1
)
model.fit(X_train, y_train)
# print abs error
print("validating model")
abs_err = np.abs(model.predict(X_test) - y_test)
# print couple perf metrics
for q in [10, 50, 90]:
print("AE-at-" + str(q) + "th-percentile: " + str(np.percentile(a=abs_err, q=q)))
# persist model
path = os.path.join(args.model_dir, "model.joblib")
joblib.dump(model, path)
print("model persisted at " + path)
print(args.min_samples_leaf)
Launching a SageMaker training job with the Python SDK
We will train two models: the first with 100 epochs, and the second with 300 epochs. The number of epochs has no specific meaning. We are interested in training two models, so we will be able to register each one of them into SageMaker Model Registry.
Launch the 1st training job
Once we’ve defined our estimator, we can specify the hyperparameters we’d like to tune and their possible values. This time we will train with 100 epochs.
[ ]:
# We use the Estimator from the SageMaker Python SDK
from sagemaker.sklearn.estimator import SKLearn
FRAMEWORK_VERSION = "1.2-1"
training_job_1_name = "sklearn-california-housing-1"
sklearn_estimator_1 = SKLearn(
entry_point="script.py",
role=get_execution_role(),
instance_count=1,
instance_type="ml.c5.xlarge",
framework_version=FRAMEWORK_VERSION,
base_job_name=training_job_1_name,
metric_definitions=[{"Name": "median-AE", "Regex": "AE-at-50th-percentile: ([0-9.]+).*$"}],
hyperparameters={
"n-estimators": 100,
"min-samples-leaf": 3,
"features": "longitude latitude housingMedianAge totalRooms totalBedrooms population households medianIncome",
"target": "medianHouseValue",
},
)
[ ]:
sklearn_estimator_1.fit({"train": trainpath, "test": testpath})
Create a Model Package Group for the trained model to be registered
Create a new Model Package Group or use an existing one to register the model
[ ]:
client = boto3.client("sagemaker")
model_package_group_name = "sklearn-california-housing-" + str(round(time.time()))
model_package_group_input_dict = {
"ModelPackageGroupName": model_package_group_name,
"ModelPackageGroupDescription": "My sample sklearn model package group",
}
create_model_pacakge_group_response = client.create_model_package_group(
**model_package_group_input_dict
)
model_package_arn = create_model_pacakge_group_response["ModelPackageGroupArn"]
print(f"ModelPackageGroup Arn : {model_package_arn}")
Register the model of the 1st training job in the Model Registry
Once the model is registered, you will see it in the Model Registry tab of the SageMaker Studio UI. The model is registered with the approval_status
set to “Approved”. By default, the model is registered with the approval_status
set to PendingManualApproval
. Users can then navigate to the Model Registry to manually approve the model based on any criteria set for model evaluation or this can be done via API.
[ ]:
inference_instance_type = "ml.m5.xlarge"
model_package_1 = sklearn_estimator_1.register(
model_package_group_name=model_package_arn,
inference_instances=[inference_instance_type],
transform_instances=[inference_instance_type],
content_types=["text/csv"],
response_types=["text/csv"],
approval_status="Approved",
)
model_package_arn_1 = model_package_1.model_package_arn
print("Model Package ARN : ", model_package_arn_1)
Create a transform job with the default configurations from the model of the 1st training job
[ ]:
sklearn_1_transformer = model_package_1.transformer(
instance_count=1, instance_type=inference_instance_type
)
[ ]:
sklearn_1_transformer.transform(eval_s3_prefix, split_type="Line", content_type="text/csv")
Let’s inspect the output of the Batch Transform job in S3. It should show the median income in block group.
[ ]:
sklearn_1_transformer.output_path
[ ]:
output_file_name = "california_housing_eval.csv.out"
[ ]:
!aws s3 cp {sklearn_1_transformer.output_path}/{output_file_name} .
[ ]:
pd.read_csv(output_file_name, sep=",", header=None)
Launch the 2nd training job
This time we will train with 300 epochs.
[ ]:
training_job_2_name = "sklearn-california-housing-2"
sklearn_estimator_2 = SKLearn(
entry_point="script.py",
role=get_execution_role(),
instance_count=1,
instance_type="ml.c5.xlarge",
framework_version=FRAMEWORK_VERSION,
base_job_name=training_job_2_name,
metric_definitions=[{"Name": "median-AE", "Regex": "AE-at-50th-percentile: ([0-9.]+).*$"}],
hyperparameters={
"n-estimators": 300,
"min-samples-leaf": 3,
"features": "longitude latitude housingMedianAge totalRooms totalBedrooms population households medianIncome",
"target": "medianHouseValue",
},
)
[ ]:
sklearn_estimator_2.fit({"train": trainpath, "test": testpath})
Register the model of 2nd training job in the Model Registry
[ ]:
inference_instance_type = "ml.c5.xlarge"
model_package_2 = sklearn_estimator_2.register(
model_package_group_name=model_package_arn,
inference_instances=[inference_instance_type],
transform_instances=[inference_instance_type],
content_types=["text/csv"],
response_types=["text/csv"],
approval_status="Approved",
)
model_package_arn_2 = model_package_2.model_package_arn
print("Model Package ARN : ", model_package_arn_2)
View Model Groups and Versions
You can view details of a specific model version by using either the AWS SDK for Python (Boto3) or by using Amazon SageMaker Studio. To view the details of a model version by using Boto3, call the list_model_packages
method to view the model versions in a model group
[ ]:
list_model_packages_response = client.list_model_packages(ModelPackageGroupName=model_package_arn)
list_model_packages_response
Let’s fetch the latest model version from the Model Package Group
[ ]:
latest_model_version_arn = list_model_packages_response["ModelPackageSummaryList"][0][
"ModelPackageArn"
]
print(latest_model_version_arn)
View the latest Model Version details
Call describe_model_package
to see the details of the model version. You pass in the ARN of a model version that you got in the output of the call to list_model_packages
.
[ ]:
client.describe_model_package(ModelPackageName=latest_model_version_arn)
Create a transform job with the default configurations from the model of the 2nd training job
[ ]:
sklearn_2_transformer = model_package_2.transformer(
instance_count=1, instance_type=inference_instance_type
)
[ ]:
sklearn_2_transformer.transform(eval_s3_prefix, split_type="Line", content_type="text/csv")
Let’s inspect the output locations of both Batch Transform jobs in S3. You can see they have different locations due to their separate Batch Transform jobs.
[ ]:
sklearn_1_transformer.output_path
[ ]:
sklearn_2_transformer.output_path
Conclusion
In this notebook you successfully downloaded the California housing dataset and trained a model using SageMaker Python SDK. Then you created a ModelPackageGroup
, registered the Model Version in SageMaker Model Registry, and triggered a SageMaker Batch Transform Job to process the evaluation dataset from S3.
You trained another model, this time with 300 epochs, registered this Model Version in SageMaker Model Registry, viewed the model versions, and again, triggered a SageMaker Batch Transform Job to process the evaluation dataset from S3.
As next steps, you can try registering your own model in SageMaker Model Registry, and run a SageMaker Batch Transform Job on data you have on S3.
Notebook CI Test Results
This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.