Housing Price Prediction with Amazon SageMaker Autopilot

This notebook’s CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

Using Autopilot to Predict House Prices in California

Kernel Python 3 (Data Science) works well with this notebook. You will have the best experience running this within SageMaker Studio.

Introduction

Amazon SageMaker Autopilot is an automated machine learning (commonly referred to as AutoML) solution for tabular datasets. You can use SageMaker Autopilot in different ways: on autopilot (without any human input) or with human guidance, without code through SageMaker Studio or scripted using the AWS SDKs. This notebook will use the AWS SDKs to simply create and deploy a machine learning model without doing any feature engineering manually. We will also explore the auto-generated feature importance report.

Predicting housing prices is a classic linear regression problem in ML. The data pertains to the houses found in a given California district (as a group) and include summary stats about them based on census data from 1990. The dataset variables are easily understandable, and the columns are as follows:

longitude
latitude
housingMedianAge
totalRooms
totalBedrooms
population
households
medianIncome
medianHouseValue (target)

What we’re going to try to predict is the median house value for a district. We will let Autopilot perform feature engineering, model selection, model tuning, and give us the best candidate model ready to use for inferences.

Setup

This notebook was created and tested on a ml.t3.medium notebook instance.

Let’s start by specifying:

The S3 bucket and prefix to use for training and model data. This should be within the same region as the Notebook Instance, training, and hosting. The following code will use SageMaker’s default S3 bucket (and create one if it doesn’t exist).
The IAM role ARN used to give training and hosting access to your data. See the documentation for how to create these. The following code will use the SageMaker execution role.

[ ]:

import sagemaker
import boto3
from sagemaker import get_execution_role

region = boto3.Session().region_name

session = sagemaker.Session()

# You can modify the following to use a bucket of your choosing
bucket = session.default_bucket()
prefix = "sagemaker/DEMO-autopilot-housing"

role = get_execution_role()

# This is the client we will use to interact with SageMaker Autopilot
sm = boto3.Session().client(service_name="sagemaker", region_name=region)

Next, we’ll import the Python libraries we’ll need for the remainder of the exercise.

[ ]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import io
from urllib.parse import urlparse
import time

Prepare Training Data

We will use the publicly available California housing dataset. The target variable is the median house value for California districts.

This dataset was obtained from the StatLib repository (http://lib.stat.cmu.edu/datasets/) and derived from the 1990 U.S. census, using one row per census block group. A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people).

[ ]:

s3 = boto3.client("s3")
s3.download_file(
    f"sagemaker-example-files-prod-{region}",
    "datasets/tabular/california_housing/cal_housing.tgz",
    "cal_housing.tgz",
)

[ ]:

!tar -zxf cal_housing.tgz -o

[ ]:

columns = [
    "longitude",
    "latitude",
    "housingMedianAge",
    "totalRooms",
    "totalBedrooms",
    "population",
    "households",
    "medianIncome",
    "medianHouseValue",
]

target = "medianHouseValue"

cal_housing_df = pd.read_csv("CaliforniaHousing/cal_housing.data", names=columns, header=None)

Before starting an Autopilot job, it is good practice to inspect the data to ensure there are no obvious issues with it.

[ ]:

pd.set_option("display.max_columns", 500)
cal_housing_df

Notice that the various columns have values in different ranges. For example, values under the target column medianHouseValue are orders of magnitude higher than other columns. This difference in scale often causes issues when training an ML model, which is why it’s a common feature engineering practice to normalize or standardize values (depending on the nature of the numeric distribution and presence of outliers).

However, because Autopilot handles feature engineering automatically (among other things), we’re not going to make any analysis or transformations ourselves.

To also illustrate how Autopilot handles data issues, such as missing values, let’s introduce some random empty values in our dataset for the housing median age column.

[ ]:

# Take a small, random sample of the dataset
cal_housing_df_sample = cal_housing_df.sample(frac=0.01, random_state=100)

cal_housing_df["housingMedianAge"].loc[cal_housing_df_sample.index] = (
    cal_housing_df["housingMedianAge"].loc[cal_housing_df_sample.index].apply(lambda row: np.nan)
)

Now our dataset should contain empty values for ~1% of the rows under the housingMedianAge column.

[ ]:

isna_sum = cal_housing_df["housingMedianAge"].isna().sum()
print("{0} values removed from housingMedianAge column".format(isna_sum))

[ ]:

# Split data for training and testing (80/20 split)

train_data = cal_housing_df.sample(frac=0.8, random_state=200)

test_data = cal_housing_df.drop(train_data.index)

test_data_no_target = test_data.drop(columns=[target])

[ ]:

### Upload the dataset to S3

train_file = "train_data.csv"
train_data.to_csv(train_file, index=False, header=True)
train_data_s3_path = session.upload_data(path=train_file, key_prefix=prefix + "/train")
print("Train data uploaded to: " + train_data_s3_path)

test_file = "test_data_no_target.csv"
test_data_no_target.to_csv(test_file, index=False, header=False)
test_data_s3_path = session.upload_data(path=test_file, key_prefix=prefix + "/test")
print("Test data uploaded to: " + test_data_s3_path)

Setting up the SageMaker Autopilot Job

After uploading the dataset to Amazon S3, you can invoke Autopilot to find the best ML pipeline to train a model on this dataset.

The required inputs for invoking an Autopilot job are: * Amazon S3 location for input dataset and for all output artifacts * Name of the target column in the dataset for predictions * An IAM role

Currently, Autopilot supports only tabular datasets in CSV format. Either all files should have a header row, or the first file of the dataset, when sorted in alphabetical/lexical order by name, is expected to have a header row.

[ ]:

input_data_config = [
    {
        "DataSource": {
            "S3DataSource": {
                "S3DataType": "S3Prefix",
                "S3Uri": "s3://{}/{}/train".format(bucket, prefix),
            }
        },
        "TargetAttributeName": target,
    }
]

job_config = {"CompletionCriteria": {"MaxCandidates": 10}}


output_data_config = {"S3OutputPath": "s3://{}/{}/output".format(bucket, prefix)}

You can also specify the type of problem you want to solve with your dataset (Regression, MulticlassClassification, BinaryClassification). In case you are not sure, SageMaker Autopilot will infer the problem type based on statistics of the target column (the column you want to predict).

Because the target attribute, medianHouseValue, is a continuous numeric variable, Autopilot will infer the problem type as regression.

You have the option to limit the running time of a SageMaker Autopilot job by providing either the maximum number of pipeline evaluations or candidates (one pipeline evaluation is called a Candidate because it generates a candidate model) or providing the total time allocated for the overall Autopilot job. Under default settings, this job may take several hours to run. This varies between runs because of the nature of the exploratory process Autopilot uses to find optimal training parameters.

For this demo, we limit the number of candidates to 10 so that the job finishes in under 1 hour.

Finally, you also have the option to deploy the winning model to a SageMaker endpoint automatically upon completion. In this case, we will not deploy the endpoint. We’ll run a batch prediction job later instead to evaluate our model.

For guidance on how to configure the job parameters, check out the SDK documentation.

Launching the SageMaker Autopilot Job

You can now launch the Autopilot job by calling the create_auto_ml_job API.

[ ]:

from time import gmtime, strftime, sleep

timestamp_suffix = strftime("%Y%m%d-%H-%M", gmtime())

auto_ml_job_name = "automl-housing-" + timestamp_suffix
print("AutoMLJobName: " + auto_ml_job_name)

sm.create_auto_ml_job(
    AutoMLJobName=auto_ml_job_name,
    InputDataConfig=input_data_config,
    OutputDataConfig=output_data_config,
    AutoMLJobConfig=job_config,
    # Uncomment to automatically deploy an endpoint
    # ModelDeployConfig={
    #'AutoGenerateEndpointName': True,
    #'EndpointName': 'autopilot-DEMO-housing-' + timestamp_suffix
    # },
    RoleArn=role,
)

The Autopilot job will now be performing the following steps:

Data Analysis
Feature Engineering
Model selection
Model tuning (hyperparameter optimization)
Model feature importance (SageMaker Clarify)

Tracking SageMaker Autopilot job progress

[ ]:

print("JobStatus - Secondary Status")
print("------------------------------")


describe_response = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
print(describe_response["AutoMLJobStatus"] + " - " + describe_response["AutoMLJobSecondaryStatus"])
job_run_status = describe_response["AutoMLJobStatus"]

while job_run_status not in ("Failed", "Completed", "Stopped"):
    describe_response = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
    job_run_status = describe_response["AutoMLJobStatus"]

    print(
        describe_response["AutoMLJobStatus"] + " - " + describe_response["AutoMLJobSecondaryStatus"]
    )
    sleep(60)

Results

Now you can use the describe_auto_ml_job API to look up the best candidate selected by the SageMaker Autopilot job.

[ ]:

best_candidate = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)["BestCandidate"]
best_candidate_name = best_candidate["CandidateName"]

print("\n")
print("CandidateName: " + best_candidate_name)
print(
    "FinalAutoMLJobObjectiveMetricName: "
    + best_candidate["FinalAutoMLJobObjectiveMetric"]["MetricName"]
)
print(
    "FinalAutoMLJobObjectiveMetricValue: "
    + str(best_candidate["FinalAutoMLJobObjectiveMetric"]["Value"])
)
print("\nBest candidate details:: " + str(best_candidate))

If you are curious to explore the performance of other algorithms that Autopilot explored, you can enumerate them via list_candidates_for_auto_ml_job API call

[ ]:

sm_dict = sm.list_candidates_for_auto_ml_job(AutoMLJobName=auto_ml_job_name)
for item in sm_dict["Candidates"]:
    print(item["CandidateName"], item["FinalAutoMLJobObjectiveMetric"])
    print(item["InferenceContainers"][1]["Image"], "\n")

Autopilot automatically generates two executable Jupyter Notebooks:

SageMakerAutopilotDataExplorationNotebook.ipynb
SageMakerAutopilotCandidateDefinitionNotebook.ipynb

These notebooks are stored in S3. Let us download them onto our SageMaker Notebook instance, so we can explore them.

[ ]:

candidate_nbk = describe_response["AutoMLJobArtifacts"]["CandidateDefinitionNotebookLocation"]
data_explore_nbk = describe_response["AutoMLJobArtifacts"]["DataExplorationNotebookLocation"]


def split_s3_path(s3_path):
    path_parts = s3_path.replace("s3://", "").split("/")
    bucket = path_parts.pop(0)
    key = "/".join(path_parts)
    return bucket, key


s3_bucket, candidate_nbk_key = split_s3_path(candidate_nbk)
_, data_explore_nbk_key = split_s3_path(data_explore_nbk)

print(s3_bucket, candidate_nbk_key, data_explore_nbk_key)

session.download_data(path="./", bucket=s3_bucket, key_prefix=candidate_nbk_key)

session.download_data(path="./", bucket=s3_bucket, key_prefix=data_explore_nbk_key)

Take some time to inspect the Data Exploration notebook. Check the ``Column Analysis and Descriptive Statistics`` section to see the analysis carried out by Autopilot.

Now, take some time to inspect the Candidate Generation notebook. Check the ``Generated Candidates`` section to see the different algorithms and data transformation strategies used by Autopilot.

Autopilot also automatically generates a feature importance report.

[ ]:

explainability_prefix = best_candidate["CandidateProperties"]["CandidateArtifactLocations"][
    "Explainability"
]

s3_bucket, explainability_dir = split_s3_path(explainability_prefix)

session.download_data(path="./", bucket=s3_bucket, key_prefix=explainability_dir)

The preceding code will download a directory to our local environment. In that directory (the prefix is the autopilot job name, the suffix is automatically generated), you should see the SageMaker Clarify artifacts. SageMaker Clarify provides greater visibility into training data and models to identify and limit bias and explain predictions. The following code will open the feature importance report:

[ ]:

from IPython.display import IFrame

# Fetch the auto-generated directory name for the SageMaker Clarify artifacts
dir_name = (
    session.list_s3_files(bucket=s3_bucket, key_prefix=explainability_dir)[0]
    .replace(explainability_dir, "")
    .split("/")[1]
)

# Display HTML report
IFrame(src=f"{dir_name}/report.html", width=700, height=600)

Your results may vary. But you’re likely to see latitude and longitude (i.e., location) on top, along with population size and median income, which are stronger predictors of housing prices than the other features in the dataset.

In SageMaker Studio, you can also navigate to SageMaker resources tab, click on Experiments and trials, and find your Autopilot experiment. You can double-click on the experiment name to list all trials, and from there you can double-click on a specific trial to see its details, including charts and metrics.

Evaluate Model Using Test Dataset

To evaluate the model on previously unseen data, we will test it against the test dataset we prepared earlier. For that, we don’t necessarily need to deploy the model to an endpoint, we can simply run a batch transform job to get predictions for our unlabeled test dataset.

Set up Transform Job

[ ]:

from sagemaker import AutoML

automl = AutoML.attach(auto_ml_job_name=auto_ml_job_name)

s3_transform_output_path = "s3://{}/{}/inference-results/".format(s3_bucket, prefix)

model_name = "{0}-model".format(best_candidate_name)

model = automl.create_model(
    name=model_name,
    candidate=best_candidate,
)

output_path = s3_transform_output_path + best_candidate_name + "/"

transformer = model.transformer(
    instance_count=1,
    instance_type="ml.m5.xlarge",
    assemble_with="Line",
    strategy="SingleRecord",
    output_path=output_path,
    env={"SAGEMAKER_MODEL_SERVER_TIMEOUT": "100", "SAGEMAKER_MODEL_SERVER_WORKERS": "1"},
)

Launch Transform Job

[ ]:

transformer.transform(
    data=test_data_s3_path,
    split_type="Line",
    content_type="text/csv",
    wait=False,
    model_client_config={"InvocationsTimeoutInSeconds": 80, "InvocationsMaxRetries": 1},
)

print("Starting transform job {}".format(transformer._current_job_name))

Track Transform Job Status

[ ]:

## Wait for jobs to finish
pending_complete = True
job_name = transformer._current_job_name

while pending_complete:
    pending_complete = False

    description = sm.describe_transform_job(TransformJobName=job_name)
    if description["TransformJobStatus"] not in ["Failed", "Completed"]:
        pending_complete = True

    print("{} transform job is running.".format(job_name))
    time.sleep(60)

print("\nCompleted.")

Evaluate the Inference Results

The transform job will have now generated a CSV file with inference results for the test dataset. We will use those results and compare them with the real test labels to see how the model performs compared to real data.

[ ]:

def get_csv_from_s3(s3uri, file_name):
    parsed_url = urlparse(s3uri)
    bucket_name = parsed_url.netloc
    prefix = parsed_url.path[1:].strip("/")
    s3 = boto3.resource("s3")
    obj = s3.Object(bucket_name, "{}/{}".format(prefix, file_name))
    return obj.get()["Body"].read().decode("utf-8")


job_status = sm.describe_transform_job(TransformJobName=job_name)["TransformJobStatus"]

if job_status == "Completed":
    pred_csv = get_csv_from_s3(transformer.output_path, "{}.out".format(test_file))
    predictions = pd.read_csv(io.StringIO(pred_csv), header=None)

[ ]:

from sklearn.metrics import mean_squared_error
from math import sqrt

labels_df = test_data[target]
mse = mean_squared_error(labels_df, predictions)
rmse = sqrt(mse)

print("MSE: {0}\nRMSE: {1}".format(mse, rmse))

Cleanup

The Autopilot job creates many underlying artifacts such as dataset splits, preprocessing scripts, or preprocessed data, etc. This code, when un-commented, deletes them.

[ ]:

s3 = boto3.resource("s3")
s3_bucket = s3.Bucket(bucket)

print(s3_bucket)
job_outputs_prefix = "{}/output/{}".format(prefix, auto_ml_job_name)
print(job_outputs_prefix)

# Delete S3 objects
s3_bucket.objects.filter(Prefix=job_outputs_prefix).delete()

We then delete all the experiment and model resources created by the Autopilot experiment.

[ ]:

def cleanup_experiment_resources(experiment_name):
    trials = sm.list_trials(ExperimentName=experiment_name)["TrialSummaries"]
    print("TrialNames:")
    for trial in trials:
        trial_name = trial["TrialName"]
        print(f"\n{trial_name}")

        components_in_trial = sm.list_trial_components(TrialName=trial_name)
        print("\tTrialComponentNames:")
        for component in components_in_trial["TrialComponentSummaries"]:
            component_name = component["TrialComponentName"]
            print(f"\t{component_name}")
            sm.disassociate_trial_component(TrialComponentName=component_name, TrialName=trial_name)
            try:
                # comment out to keep trial components
                sm.delete_trial_component(TrialComponentName=component_name)
            except:
                # component is associated with another trial
                continue
            # to prevent throttling
            time.sleep(5)
        sm.delete_trial(TrialName=trial_name)
    sm.delete_experiment(ExperimentName=experiment_name)
    print(f"\nExperiment {experiment_name} deleted")


def cleanup_autopilot_models(autopilot_job_name):
    print("{0}:\n".format(autopilot_job_name))
    response = sm.list_models(NameContains=autopilot_job_name)

    for model in response["Models"]:
        model_name = model["ModelName"]
        print(f"\t{model_name}")
        sm.delete_model(ModelName=model_name)
        # to prevent throttling
        time.sleep(3)

[ ]:

cleanup_experiment_resources("{0}-aws-auto-ml-job".format(auto_ml_job_name))

[ ]:

cleanup_autopilot_models(auto_ml_job_name)

Finally, the following code, when uncommented, will delete the local files used in this demo.

[ ]:

import shutil
import glob
import os


def delete_local_files():
    base_path = ""
    dir_list = glob.iglob(os.path.join(base_path, "{0}*".format(auto_ml_job_name)))

    for path in dir_list:
        if os.path.isdir(path):
            shutil.rmtree(path)

    if os.path.exists("CaliforniaHousing"):
        shutil.rmtree("CaliforniaHousing")

    if os.path.exists("cal_housing.tgz"):
        os.remove("cal_housing.tgz")

    if os.path.exists("SageMakerAutopilotCandidateDefinitionNotebook.ipynb"):
        os.remove("SageMakerAutopilotCandidateDefinitionNotebook.ipynb")

    if os.path.exists("SageMakerAutopilotDataExplorationNotebook.ipynb"):
        os.remove("SageMakerAutopilotDataExplorationNotebook.ipynb")

    if os.path.exists("test_data_no_target.csv"):
        os.remove("test_data_no_target.csv")

    if os.path.exists("test_data.csv"):
        os.remove("test_data.csv")

    if os.path.exists("train_data.csv"):
        os.remove("train_data.csv")


## UNCOMMENT TO CLEAN UP LOCAL FILES
# delete_local_files()

Note: If you enabled automatic endpoint creation, you will need to delete the endpoint manually.

Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.