Develop, Train, Register and Batch Transform Scikit-Learn Random Forest

In this notebook we show how to use Amazon SageMaker to train a Scikit-learn Random Forest model, register it in Model Registry, and run a Batch Transform Job. More info on Scikit-Learn can be found here https://scikit-learn.org/stable/index.html. We use the California Housing dataset, present in Scikit-Learn: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html. The California Housing dataset was originally published in:

Pace, R. Kelley, and Ronald Barry. “Sparse spatial auto regressions.” Statistics & Probability Letters 33.3 (1997): 291-297.

Link to the paper: https://doi.org/10.1016/S0167-7152(96)00140-X

[ ]:
import sys

!{sys.executable} -m pip install sagemaker scikit-learn==0.23.1 --upgrade
[ ]:
import datetime
import time
import tarfile

import boto3
import pandas as pd
import numpy as np
from sagemaker import get_execution_role
import sagemaker
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing


s3 = boto3.client("s3")
sm_boto3 = boto3.client("sagemaker")

sess = sagemaker.Session()

region = sess.boto_session.region_name

bucket = sess.default_bucket()  # this could also be a hard-coded bucket name

print("Using bucket " + bucket)

Prepare data

We use the California housing dataset.

More info on the dataset:

This dataset was obtained from the StatLib repository. http://lib.stat.cmu.edu/datasets/

The target variable is the median house value for California districts.

This dataset was derived from the 1990 U.S. census, using one row per census block group. A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people).

[ ]:
!aws s3 cp s3://sagemaker-sample-files/datasets/tabular/california_housing/cal_housing.tgz .
[ ]:
!tar -zxf cal_housing.tgz
[ ]:
columns = [
    "longitude",
    "latitude",
    "housingMedianAge",
    "totalRooms",
    "totalBedrooms",
    "population",
    "households",
    "medianIncome",
    "medianHouseValue",
]
california_housing_df = pd.read_csv(
    "CaliforniaHousing/cal_housing.data", names=columns, header=None
)
[ ]:
california_housing_df.head()
[ ]:
x_train, x_test = train_test_split(california_housing_df, test_size=0.25)

x_eval = x_test[
    [
        "longitude",
        "latitude",
        "housingMedianAge",
        "totalRooms",
        "totalBedrooms",
        "population",
        "households",
        "medianIncome",
    ]
]

Let’s inspect the training dataset

[ ]:
x_train.head()
[ ]:
x_train.shape

Save training, testing and evaluation data as csv and upload to S3

[ ]:
x_train.to_csv("california_housing_train.csv")
x_test.to_csv("california_housing_test.csv")
x_eval.to_csv("california_housing_eval.csv", header=False, index=False)

Upload training and evaluation data to S3, as SageMaker Training Job, and afterward, Batch Transform Job will take it from there.

[ ]:
trainpath = sess.upload_data(
    path="california_housing_train.csv", bucket=bucket, key_prefix="sagemaker/sklearn-train"
)

testpath = sess.upload_data(
    path="california_housing_test.csv", bucket=bucket, key_prefix="sagemaker/sklearn-train"
)

print(trainpath)
print(testpath)
[ ]:
sess.upload_data(
    path="california_housing_eval.csv", bucket=bucket, key_prefix="sagemaker/sklearn-eval"
)

eval_s3_prefix = f"s3://{bucket}/sagemaker/sklearn-eval/"
eval_s3_prefix

Writing a Script Mode script

The below script contains both training and inference functionality and can run both in SageMaker Training hardware or locally (desktop, SageMaker notebook, on premise, etc.). Detailed guidance here https://sagemaker.readthedocs.io/en/stable/using_sklearn.html#preparing-the-scikit-learn-training-script

[ ]:
%%writefile script.py

import argparse
import joblib
import os

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor


# inference functions ---------------
def model_fn(model_dir):
    clf = joblib.load(os.path.join(model_dir, "model.joblib"))
    return clf


if __name__ == "__main__":

    print("extracting arguments")
    parser = argparse.ArgumentParser()

    # hyperparameters sent by the client are passed as command-line arguments to the script.
    # to simplify the demo we don't use all sklearn RandomForest hyperparameters
    parser.add_argument("--n-estimators", type=int, default=10)
    parser.add_argument("--min-samples-leaf", type=int, default=3)

    # Data, model, and output directories
    parser.add_argument("--model-dir", type=str, default=os.environ.get("SM_MODEL_DIR"))
    parser.add_argument("--train", type=str, default=os.environ.get("SM_CHANNEL_TRAIN"))
    parser.add_argument("--test", type=str, default=os.environ.get("SM_CHANNEL_TEST"))
    parser.add_argument("--train-file", type=str, default="california_housing_train.csv")
    parser.add_argument("--test-file", type=str, default="california_housing_test.csv")
    parser.add_argument(
        "--features", type=str
    )  # in this script we ask user to explicitly name features
    parser.add_argument(
        "--target", type=str
    )  # in this script we ask user to explicitly name the target

    args, _ = parser.parse_known_args()

    print("reading data")
    train_df = pd.read_csv(os.path.join(args.train, args.train_file))
    test_df = pd.read_csv(os.path.join(args.test, args.test_file))

    print("building training and testing datasets")
    X_train = train_df[args.features.split()]
    X_test = test_df[args.features.split()]
    y_train = train_df[args.target]
    y_test = test_df[args.target]

    # train
    print("training model")
    model = RandomForestRegressor(
        n_estimators=args.n_estimators, min_samples_leaf=args.min_samples_leaf, n_jobs=-1
    )

    model.fit(X_train, y_train)

    # print abs error
    print("validating model")
    abs_err = np.abs(model.predict(X_test) - y_test)

    # print couple perf metrics
    for q in [10, 50, 90]:
        print("AE-at-" + str(q) + "th-percentile: " + str(np.percentile(a=abs_err, q=q)))

    # persist model
    path = os.path.join(args.model_dir, "model.joblib")
    joblib.dump(model, path)
    print("model persisted at " + path)
    print(args.min_samples_leaf)

Launching a SageMaker training job with the Python SDK

We will train two models: the first with 100 epochs, and the second with 300 epochs. The number of epochs has no specific meaning. We are interested in training two models, so we will be able to register each one of them into SageMaker Model Registry.

Launch the 1st training job

Once we’ve defined our estimator, we can specify the hyperparameters we’d like to tune and their possible values. This time we will train with 100 epochs.

[ ]:
# We use the Estimator from the SageMaker Python SDK
from sagemaker.sklearn.estimator import SKLearn

FRAMEWORK_VERSION = "1.0-1"
training_job_1_name = "sklearn-california-housing-1"

sklearn_estimator_1 = SKLearn(
    entry_point="script.py",
    role=get_execution_role(),
    instance_count=1,
    instance_type="ml.c5.xlarge",
    framework_version=FRAMEWORK_VERSION,
    base_job_name=training_job_1_name,
    metric_definitions=[{"Name": "median-AE", "Regex": "AE-at-50th-percentile: ([0-9.]+).*$"}],
    hyperparameters={
        "n-estimators": 100,
        "min-samples-leaf": 3,
        "features": "longitude latitude housingMedianAge totalRooms totalBedrooms population households medianIncome",
        "target": "medianHouseValue",
    },
)
[ ]:
sklearn_estimator_1.fit({"train": trainpath, "test": testpath})

Create a Model Package Group for the trained model to be registered

Create a new Model Package Group or use an existing one to register the model

[ ]:
client = boto3.client("sagemaker")

model_package_group_name = "sklearn-california-housing-" + str(round(time.time()))
model_package_group_input_dict = {
    "ModelPackageGroupName": model_package_group_name,
    "ModelPackageGroupDescription": "My sample sklearn model package group",
}

create_model_pacakge_group_response = client.create_model_package_group(
    **model_package_group_input_dict
)
model_package_arn = create_model_pacakge_group_response["ModelPackageGroupArn"]
print(f"ModelPackageGroup Arn : {model_package_arn}")

Register the model of the 1st training job in the Model Registry

Once the model is registered, you will see it in the Model Registry tab of the SageMaker Studio UI. The model is registered with the approval_status set to “Approved”. By default, the model is registered with the approval_status set to PendingManualApproval. Users can then navigate to the Model Registry to manually approve the model based on any criteria set for model evaluation or this can be done via API.

[ ]:
inference_instance_type = "ml.m5.xlarge"
model_package_1 = sklearn_estimator_1.register(
    model_package_group_name=model_package_arn,
    inference_instances=[inference_instance_type],
    transform_instances=[inference_instance_type],
    content_types=["text/csv"],
    response_types=["text/csv"],
    approval_status="Approved",
)

model_package_arn_1 = model_package_1.model_package_arn
print("Model Package ARN : ", model_package_arn_1)

Create a transform job with the default configurations from the model of the 1st training job

[ ]:
sklearn_1_transformer = model_package_1.transformer(
    instance_count=1, instance_type=inference_instance_type
)
[ ]:
sklearn_1_transformer.transform(eval_s3_prefix, split_type="Line", content_type="text/csv")

Let’s inspect the output of the Batch Transform job in S3. It should show the median income in block group.

[ ]:
sklearn_1_transformer.output_path
[ ]:
output_file_name = "california_housing_eval.csv.out"
[ ]:
!aws s3 cp {sklearn_1_transformer.output_path}/{output_file_name} .
[ ]:
pd.read_csv(output_file_name, sep=",", header=None)

Launch the 2nd training job

This time we will train with 300 epochs.

[ ]:
training_job_2_name = "sklearn-california-housing-2"

sklearn_estimator_2 = SKLearn(
    entry_point="script.py",
    role=get_execution_role(),
    instance_count=1,
    instance_type="ml.c5.xlarge",
    framework_version=FRAMEWORK_VERSION,
    base_job_name=training_job_2_name,
    metric_definitions=[{"Name": "median-AE", "Regex": "AE-at-50th-percentile: ([0-9.]+).*$"}],
    hyperparameters={
        "n-estimators": 300,
        "min-samples-leaf": 3,
        "features": "longitude latitude housingMedianAge totalRooms totalBedrooms population households medianIncome",
        "target": "medianHouseValue",
    },
)
[ ]:
sklearn_estimator_2.fit({"train": trainpath, "test": testpath})

Register the model of 2nd training job in the Model Registry

[ ]:
inference_instance_type = "ml.c5.xlarge"
model_package_2 = sklearn_estimator_2.register(
    model_package_group_name=model_package_arn,
    inference_instances=[inference_instance_type],
    transform_instances=[inference_instance_type],
    content_types=["text/csv"],
    response_types=["text/csv"],
    approval_status="Approved",
)

model_package_arn_2 = model_package_2.model_package_arn
print("Model Package ARN : ", model_package_arn_2)

View Model Groups and Versions

You can view details of a specific model version by using either the AWS SDK for Python (Boto3) or by using Amazon SageMaker Studio. To view the details of a model version by using Boto3, call the list_model_packages method to view the model versions in a model group

[ ]:
list_model_packages_response = client.list_model_packages(ModelPackageGroupName=model_package_arn)
list_model_packages_response

Let’s fetch the latest model version from the Model Package Group

[ ]:
latest_model_version_arn = list_model_packages_response["ModelPackageSummaryList"][0][
    "ModelPackageArn"
]
print(latest_model_version_arn)

View the latest Model Version details

Call describe_model_package to see the details of the model version. You pass in the ARN of a model version that you got in the output of the call to list_model_packages.

[ ]:
client.describe_model_package(ModelPackageName=latest_model_version_arn)

Create a transform job with the default configurations from the model of the 2nd training job

[ ]:
sklearn_2_transformer = model_package_2.transformer(
    instance_count=1, instance_type=inference_instance_type
)
[ ]:
sklearn_2_transformer.transform(eval_s3_prefix, split_type="Line", content_type="text/csv")

Let’s inspect the output locations of both Batch Transform jobs in S3. You can see they have different locations due to their separate Batch Transform jobs.

[ ]:
sklearn_1_transformer.output_path
[ ]:
sklearn_2_transformer.output_path

Conclusion

In this notebook you successfully downloaded the California housing dataset and trained a model using SageMaker Python SDK. Then you created a ModelPackageGroup, registered the Model Version in SageMaker Model Registry, and triggered a SageMaker Batch Transform Job to process the evaluation dataset from S3.

You trained another model, this time with 300 epochs, registered this Model Version in SageMaker Model Registry, viewed the model versions, and again, triggered a SageMaker Batch Transform Job to process the evaluation dataset from S3.

As next steps, you can try registering your own model in SageMaker Model Registry, and run a SageMaker Batch Transform Job on data you have on S3.

[ ]: