Regression with Amazon SageMaker Autopilot (Parquet input)

This is the accompanying notebook for the blog post Run AutoML experiments with large parquet datasets using Amazon SageMaker Autopilot. The example here is almost the same as Regression with Amazon SageMaker XGBoost algorithm (Parquet).

This notebook tackles the exact same problem with the same solution, but has been modified for a Parquet input to be trained in SageMaker Autopilot. The original notebook provides details of dataset and the machine learning use-case.

This notebook was tested in Amazon SageMaker Studio on a ml.t3.medium instance with Python 3 (Data Science) kernel.

[ ]:
! pip install --upgrade boto3
[ ]:
import os
import boto3
import re
import sagemaker
from sagemaker import get_execution_role

role = get_execution_role()
region = boto3.Session().region_name

# S3 bucket for saving code and model artifacts.
# Feel free to specify a different bucket here if you wish.
bucket = sagemaker.Session().default_bucket()
prefix = "sagemaker/DEMO-automl-parquet"

We will use PyArrow library to store the Abalone dataset in the Parquet format.

[ ]:
import pyarrow
[ ]:
%%time

import numpy as np
import pandas as pd

s3 = boto3.client("s3")
# Download the dataset and load into a pandas dataframe
FILE_NAME = "abalone.csv"
s3.download_file("sagemaker-sample-files", f"datasets/tabular/uci_abalone/abalone.csv", FILE_NAME)

feature_names = [
    "Sex",
    "Length",
    "Diameter",
    "Height",
    "Whole weight",
    "Shucked weight",
    "Viscera weight",
    "Shell weight",
    "Rings",
]
data = pd.read_csv(FILE_NAME, header=None, names=feature_names)

data.to_parquet("abalone.parquet")
[ ]:
%%time
sagemaker.Session().upload_data("abalone.parquet", bucket=bucket, key_prefix=prefix)

After setting the parameters, we kick off training, and poll for status until training is completed, which in this example, takes under 1 hour.

[ ]:
%%time
import time
from time import gmtime, strftime

job_name = "autopilot-parquet-" + strftime("%m-%d-%H-%M", gmtime())
print("AutoML job:", job_name)

create_auto_ml_job_params = {
    "AutoMLJobConfig": {
        "CompletionCriteria": {
            "MaxCandidates": 50,
        }
    },
    "AutoMLJobName": job_name,
    "InputDataConfig": [
        {
            "ContentType": "x-application/vnd.amazon+parquet",
            "CompressionType": "None",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": f"s3://{bucket}/{prefix}/abalone.parquet",
                }
            },
            "TargetAttributeName": "Rings",
        }
    ],
    "OutputDataConfig": {"S3OutputPath": f"s3://{bucket}/{prefix}/output"},
    "RoleArn": role,
}

client = boto3.client("sagemaker", region_name=region)
client.create_auto_ml_job(**create_auto_ml_job_params)

response = client.describe_auto_ml_job(AutoMLJobName=job_name)
status = response["AutoMLJobStatus"]
secondary_status = response["AutoMLJobSecondaryStatus"]
print(f"{status} - {secondary_status}")

while status != "Completed" and secondary_status != "Failed":
    time.sleep(60)
    response = client.describe_auto_ml_job(AutoMLJobName=job_name)
    status = response["AutoMLJobStatus"]
    secondary_status = response["AutoMLJobSecondaryStatus"]
    print(f"{status} - {secondary_status}")

Please refer to other Autopilot example notebooks such as Direct Marketing with Amazon SageMaker Autopilot and Customer Churn Prediction with Amazon SageMaker Autopilot to see how to investigate details of each training, deploy the best candidate and run inference.