Multi-model SageMaker Pipeline with Hyperparamater Tuning and Experiments


This notebook’s CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

This us-west-2 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable


Before proceeding, please see the context of this notebook in README.md. This notebook has been tested in a SageMaker notebook that is using a kernel with at least Python 3.7 installed, e.g. conda_mxnet_latest_p37, conda_python3. Make sure you have created a SageMaker project outside of this notebook with the name restate. Recommendation is to create a SageMaker project using SageMaker-provide MLOps template for model building, training, and deployment template. Note that this notebook will not create the SageMaker project for you.

Pre-requisities

We create an S3 bucket and with encryption enabled for additional security.

[ ]:
import boto3

AWS_ACCOUNT = boto3.client("sts").get_caller_identity()["Account"]
AWS_REGION = boto3.Session().region_name
BUCKET_NAME = "sagemaker-restate-{AWS_ACCOUNT}".format(AWS_ACCOUNT=AWS_ACCOUNT)

s3_client = boto3.client("s3")
location = {"LocationConstraint": AWS_REGION}
s3_client.create_bucket(Bucket=BUCKET_NAME, CreateBucketConfiguration=location)
s3_client.put_bucket_encryption(
    Bucket=BUCKET_NAME,
    ServerSideEncryptionConfiguration={
        "Rules": [
            {
                "ApplyServerSideEncryptionByDefault": {"SSEAlgorithm": "AES256"},
            },
        ]
    },
)

We create IAM role AWSGlueServiceRole-restate.

[ ]:
import json

iam_client = boto3.client("iam")

glue_assume_role_policy_document = json.dumps(
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Principal": {"Service": "glue.amazonaws.com"},
                "Action": "sts:AssumeRole",
            }
        ],
    }
)

response = iam_client.create_role(
    RoleName="AWSGlueServiceRole-restate", AssumeRolePolicyDocument=glue_assume_role_policy_document
)

iam_client.attach_role_policy(
    RoleName=response["Role"]["RoleName"], PolicyArn="arn:aws:iam::aws:policy/AmazonS3FullAccess"
)

iam_client.attach_role_policy(
    RoleName=response["Role"]["RoleName"],
    PolicyArn="arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole",
)

We create IAM role AmazonSageMakerServiceCatalogProductsUseRole-restate.

[ ]:
sagemaker_assume_role_policy_document = json.dumps(
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Principal": {"Service": "sagemaker.amazonaws.com"},
                "Action": "sts:AssumeRole",
            }
        ],
    }
)

response = iam_client.create_role(
    RoleName="AmazonSageMakerServiceCatalogProductsUseRole-restate",
    AssumeRolePolicyDocument=sagemaker_assume_role_policy_document,
)

iam_client.attach_role_policy(
    RoleName=response["Role"]["RoleName"],
    PolicyArn="arn:aws:iam::aws:policy/AmazonAthenaFullAccess",
)

iam_client.attach_role_policy(
    RoleName=response["Role"]["RoleName"],
    PolicyArn="arn:aws:iam::aws:policy/AmazonSageMakerFullAccess",
)

Prepare Athena table

At this point, it is assumed that S3 bucket sagemaker-restate-<AWS ACCOUNT ID> and the necessary IAM roles are created. For the complete list of prerequisites, please see README.md.

We move the raw data to S3 bucket sagemaker-restate-<AWS ACCOUNT ID>.

[ ]:
%%sh

AWS_ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
BUCKET_NAME="sagemaker-restate-${AWS_ACCOUNT}"

aws s3 cp s3://sagemaker-sample-files/datasets/tabular/california_housing/cal_housing.tgz .
tar -zxf cal_housing.tgz -o
aws s3 cp CaliforniaHousing/cal_housing.data s3://${BUCKET_NAME}/raw/california/



The step below creates a Glue database and table containing the raw data by running a Glue crawler. It is recommended to configure Glue encryption for additional security.

[ ]:
import boto3

AWS_ACCOUNT = boto3.client("sts").get_caller_identity()["Account"]
BUCKET_NAME = "sagemaker-restate-{AWS_ACCOUNT}".format(AWS_ACCOUNT=AWS_ACCOUNT)
DATABASE_NAME = "restate"
TABLE_NAME = "california"

glue_client = boto3.client("glue")

try:
    response = glue_client.create_database(DatabaseInput={"Name": DATABASE_NAME})
    print("Successfully created database")
except Exception as e:
    print("Error in creating database: {ERROR}".format(ERROR=e))
[ ]:
# This assumes the Glue service role name is AWSGlueServiceRole-restate
try:
    response = glue_client.create_crawler(
        Name="{DATABASE_NAME}-{TABLE_NAME}".format(
            DATABASE_NAME=DATABASE_NAME, TABLE_NAME=TABLE_NAME
        ),
        Role="AWSGlueServiceRole-restate",
        DatabaseName=DATABASE_NAME,
        Targets={
            "S3Targets": [
                {
                    "Path": "s3://{BUCKET_NAME}/raw/california/".format(BUCKET_NAME=BUCKET_NAME),
                }
            ]
        },
    )
    print("Successfully created crawler")
except Exception as e:
    print("Error in creating crawler: {ERROR}".format(ERROR=e))
[ ]:
try:
    response = glue_client.start_crawler(
        Name="{DATABASE_NAME}-{TABLE_NAME}".format(
            DATABASE_NAME=DATABASE_NAME, TABLE_NAME=TABLE_NAME
        )
    )
    print("Successfully started crawler")
except Exception as e:
    print("Error in starting crawler: {ERROR}".format(ERROR=e))

Once crawler is done crawling, table california in database restate should be visible in Glue catalog. We rename the Glue table columns for readability.

[ ]:
import time

while True:
    crawler = glue_client.get_crawler(
        Name="{DATABASE_NAME}-{TABLE_NAME}".format(
            DATABASE_NAME=DATABASE_NAME, TABLE_NAME=TABLE_NAME
        )
    )
    if crawler["Crawler"]["State"] == "READY":
        break
    print("Waiting for the crawler run to be completed..")
    time.sleep(60)

response = glue_client.get_table(DatabaseName=DATABASE_NAME, Name=TABLE_NAME)
glue_table = response["Table"]
glue_table["StorageDescriptor"]["Columns"][0]["Name"] = "longitude"
glue_table["StorageDescriptor"]["Columns"][1]["Name"] = "latitude"
glue_table["StorageDescriptor"]["Columns"][2]["Name"] = "housingMedianAge"
glue_table["StorageDescriptor"]["Columns"][3]["Name"] = "totalRooms"
glue_table["StorageDescriptor"]["Columns"][4]["Name"] = "totalBedrooms"
glue_table["StorageDescriptor"]["Columns"][5]["Name"] = "population"
glue_table["StorageDescriptor"]["Columns"][6]["Name"] = "households"
glue_table["StorageDescriptor"]["Columns"][7]["Name"] = "medianIncome"
glue_table["StorageDescriptor"]["Columns"][8]["Name"] = "medianHouseValue"
glue_client.update_table(
    DatabaseName=DATABASE_NAME,
    TableInput={"Name": TABLE_NAME, "StorageDescriptor": glue_table["StorageDescriptor"]},
)

Table california in database restate should be visible in Athena. We filter only the data where housingmedianage > 10.

Make sure Athena query result location and encryption settings are updated accordingly before proceeding to the next step.

[ ]:
query = "CREATE TABLE restate.california_10 AS SELECT * FROM restate.california where housingmedianage > 10;"
output = "s3://{BUCKET_NAME}/athena".format(BUCKET_NAME=BUCKET_NAME)

athena_client = boto3.client("athena")

try:
    response = athena_client.start_query_execution(
        QueryString=query,
        QueryExecutionContext={"Database": DATABASE_NAME},
        ResultConfiguration={
            "OutputLocation": output,
        },
    )
except Exception as e:
    print("Error running the query: {ERROR}".format(ERROR=e))

Prepare Decision Tree custom Docker image

We make a Docker image containing a custom algorithm using Scikit-learn Decision Tree Regressor. Note that the Docker image has been modified to support hyperparameter tuning and validation data.

[ ]:
! sudo yum install docker -y
[ ]:
%%sh

# The name of our algorithm
ALGORITHM_NAME=restate-decision-trees

cd container

chmod +x decision_trees/train
chmod +x decision_trees/serve

AWS_ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
AWS_REGION=$(aws configure get region)

IMAGE_FULLNAME="${AWS_ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/${ALGORITHM_NAME}:latest"

# If the repository doesn't exist in ECR, create it.
aws ecr describe-repositories --repository-names "${ALGORITHM_NAME}" > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "${ALGORITHM_NAME}" > /dev/null
fi

# Get the login command from ECR and execute it directly
aws ecr get-login-password --region ${AWS_REGION}|docker login --username AWS --password-stdin ${IMAGE_FULLNAME}

# Build the docker image locally with the image name and then push it to ECR
# with the full name.

docker build  -t ${ALGORITHM_NAME} .
docker tag ${ALGORITHM_NAME} ${IMAGE_FULLNAME}
docker push ${IMAGE_FULLNAME}

Once Docker image is pushed to ECR repository, we make the image accessible from SageMaker.

[ ]:
%%sh

# The name of our algorithm
SM_IMAGE_NAME=restate-dtree
AWS_ACCOUNT=$(aws sts get-caller-identity --query Account --output text)

# This assumes the role name is AmazonSageMakerServiceCatalogProductsUseRole-restate
ROLE_ARN="arn:aws:iam::${AWS_ACCOUNT}:role/AmazonSageMakerServiceCatalogProductsUseRole-restate"

aws sagemaker create-image \
    --image-name ${SM_IMAGE_NAME} \
    --role-arn ${ROLE_ARN}

[ ]:
%%sh
AWS_ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
ALGORITHM_NAME=restate-decision-trees
AWS_REGION=$(aws configure get region)
SM_IMAGE_NAME=restate-dtree
SM_BASE_IMAGE="${AWS_ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/${ALGORITHM_NAME}:latest"

aws sagemaker create-image-version \
    --image-name ${SM_IMAGE_NAME} \
    --base-image ${SM_BASE_IMAGE}

Start the SageMaker pipeline

Manually update restate-athena-california.flow with the queryString and s3OutputLocation of your choice. This has to be done outside of this Jupyter notebook. Once done, proceed to create your pipeline.

[ ]:
! pip install sagemaker-pipeline/

Verify that you can successfully run get-pipeline-definition.

[ ]:
! get-pipeline-definition --help

At this point, it is assumed that you have already created a SageMaker project with a name restate and a pipeline with a name sagemaker-restate.

[ ]:
%%sh

# This assumes the SageMaker pipeline role name is AmazonSageMakerServiceCatalogProductsUseRole-restate

AWS_ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
AWS_REGION=$(aws configure get region)
SAGEMAKER_PROJECT_NAME=restate
SAGEMAKER_PROJECT_ID=$(aws sagemaker describe-project --project-name ${SAGEMAKER_PROJECT_NAME} --query 'ProjectId' | tr -d '"')
echo ${SAGEMAKER_PROJECT_ID}
SAGEMAKER_PROJECT_ARN="arn:aws:sagemaker:${AWS_REGION}:${AWS_ACCOUNT}:project/${SAGEMAKER_PROJECT_NAME}"
SAGEMAKER_PIPELINE_ROLE_ARN="arn:aws:iam::${AWS_ACCOUNT}:role/AmazonSageMakerServiceCatalogProductsUseRole-restate"
SAGEMAKER_PIPELINE_NAME="sagemaker-${SAGEMAKER_PROJECT_NAME}"
ARTIFACT_BUCKET="sagemaker-project-${SAGEMAKER_PROJECT_ID}"
SAGEMAKER_PROJECT_NAME_ID="${SAGEMAKER_PROJECT_NAME}-${SAGEMAKER_PROJECT_ID}"

run-pipeline --module-name pipelines.restate.pipeline \
  --role-arn $SAGEMAKER_PIPELINE_ROLE_ARN \
  --tags "[{\"Key\":\"sagemaker:project-name\", \"Value\":\"${SAGEMAKER_PROJECT_NAME}\"}, {\"Key\":\"sagemaker:project-id\", \"Value\":\"${SAGEMAKER_PROJECT_ID}\"}]" \
  --kwargs "{\"region\":\"${AWS_REGION}\",\"sagemaker_project_arn\":\"${SAGEMAKER_PROJECT_ARN}\",\"role\":\"${SAGEMAKER_PIPELINE_ROLE_ARN}\",\"default_bucket\":\"${ARTIFACT_BUCKET}\",\"pipeline_name\":\"${SAGEMAKER_PROJECT_NAME_ID}\",\"model_package_group_name\":\"${SAGEMAKER_PROJECT_NAME_ID}\",\"base_job_prefix\":\"${SAGEMAKER_PROJECT_NAME_ID}\"}"

If you inspect the pipeline, you will see that the XGBoost model performs better than the decision tree model. Therefore, the XGBoost model is registered in the model registry.

You can experiment on the data, e.g. use data for housingmedianage > 50, by changing the Athena query in restate-athena-california.flow. You can check if XGBoost would still be the winning model after these changes.

Deploy the winning model

Make sure to update your desired MODEL_VERSION. We assume we approve the model version 1.

[ ]:
from sagemaker import get_execution_role, session
import boto3

role = get_execution_role()
sm_client = boto3.client("sagemaker")

MODEL_VERSION = "1"
SAGEMAKER_PROJECT_NAME = "restate"
SAGEMAKER_PROJECT_ID = sm_client.describe_project(ProjectName=SAGEMAKER_PROJECT_NAME)["ProjectId"]
AWS_REGION = boto3.Session().region_name
MODEL_PACKAGE_ARN = "arn:aws:sagemaker:{AWS_REGION}:{AWS_ACCOUNT}:model-package/{SAGEMAKER_PROJECT_NAME}-{SAGEMAKER_PROJECT_ID}/{MODEL_VERSION}".format(
    AWS_REGION=AWS_REGION,
    AWS_ACCOUNT=AWS_ACCOUNT,
    SAGEMAKER_PROJECT_NAME=SAGEMAKER_PROJECT_NAME,
    SAGEMAKER_PROJECT_ID=SAGEMAKER_PROJECT_ID,
    MODEL_VERSION=MODEL_VERSION,
)


model_package_update_response = sm_client.update_model_package(
    ModelPackageArn=MODEL_PACKAGE_ARN, ModelApprovalStatus="Approved"
)

At this point, you can deploy the approved model version by going through the steps below, or using MLOps template for model deployment.

[ ]:
from time import gmtime, strftime

model_name = "restate-modelregistry-model-" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print("Model name : {}".format(model_name))
container_list = [{"ModelPackageName": MODEL_PACKAGE_ARN}]

create_model_response = sm_client.create_model(
    ModelName=model_name, ExecutionRoleArn=role, Containers=container_list
)
print("Model arn : {}".format(create_model_response["ModelArn"]))
[ ]:
endpoint_config_name = "restate-modelregistry-EndpointConfig-" + strftime(
    "%Y-%m-%d-%H-%M-%S", gmtime()
)
print(endpoint_config_name)
create_endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "InstanceType": "ml.m5.large",
            "InitialVariantWeight": 1,
            "InitialInstanceCount": 1,
            "ModelName": model_name,
            "VariantName": "AllTraffic",
        }
    ],
)
[ ]:
endpoint_name = "restate-staging"
print("EndpointName={}".format(endpoint_name))

create_endpoint_response = sm_client.create_endpoint(
    EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)


while True:
    endpoint = sm_client.describe_endpoint(EndpointName=endpoint_name)
    if endpoint["EndpointStatus"] == "InService":
        break
    print("Waiting for the endpoint to be completed..")
    time.sleep(60)

print("Endpoint arn : {}".format(create_endpoint_response["EndpointArn"]))

Inference

Use the following data for inference:

-117.18,32.75,52.0,1504.0,208.0,518.0,196.0

This is a census block group with longitude -117.18, latitude 32.75, housing median age of 52.0, total rooms of 1504, total bedrooms of 208, population of 518, and households count of 196.

Let’s see its predicted value using our generated model.

[ ]:
import json

sm_runtime = boto3.client("runtime.sagemaker")
line = "-117.18,32.75,52.0,1504.0,208.0,518.0,196.0"
response = sm_runtime.invoke_endpoint(EndpointName=endpoint_name, ContentType="text/csv", Body=line)
result = json.loads(response["Body"].read().decode())
print(result)

Now you try:

-117.17,32.76,45.0,3149.0,639.0,1160.0,661.0

This is a census block group with longitude -117.17, latitude 32.76, housing median age of 45.0, total rooms of 3149, total bedrooms of 639, population of 1160, and households count of 661.

Cleanup

Cleanup the Glue database, table, crawler, and S3 buckets used.

Cleanup the ECR and SageMaker images created.

Cleanup the SageMaker model and endpoint resources.

[ ]:
sm_client.delete_endpoint(EndpointName=endpoint_name)
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm_client.delete_model(ModelName=model_name)

Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.

This us-east-1 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This us-east-2 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This us-west-1 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This ca-central-1 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This sa-east-1 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This eu-west-1 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This eu-west-2 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This eu-west-3 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This eu-central-1 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This eu-north-1 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This ap-southeast-1 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This ap-southeast-2 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This ap-northeast-1 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This ap-northeast-2 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This ap-south-1 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable