Run a SageMaker Experiment with MNIST Handwritten Digits Classification

This demo shows how you can use the SageMaker Experiments Python SDK to organize, track, compare, and evaluate your machine learning (ML) model training experiments.

You can track artifacts for experiments, including data sets, algorithms, hyperparameters, and metrics. Experiments executed on SageMaker such as SageMaker Autopilot jobs and training jobs are automatically tracked. You can also track artifacts for additional steps within an ML workflow that come before or after model training, such as data pre-processing or post-training model evaluation.

The APIs also let you search and browse your current and past experiments, compare experiments, and identify best-performing models.

We demonstrate these capabilities through an MNIST handwritten digits classification example. The experiment is organized as follows:

  1. Download and prepare the MNIST dataset.

  2. Train a Convolutional Neural Network (CNN) Model. Tune the hyperparameter that configures the number of hidden channels in the model. Track the parameter configurations and resulting model accuracy using the SageMaker Experiments Python SDK.

  3. Finally use the search and analytics capabilities of the SDK to search, compare and evaluate the performance of all model versions generated from model tuning in Step 2.

  4. We also show an example of tracing the complete lineage of a model version: the collection of all the data pre-processing and training configurations and inputs that went into creating that model version.

Make sure you select the Python 3 (Data Science) kernel in Studio, or conda_pytorch_p36 in a notebook instance.

Runtime

This notebook takes approximately 25 minutes to run.

Contents

  1. Install modules

  2. Setup

  3. Download the dataset

  4. Step 1: Set up the Experiment

  5. Step 2: Track Experiment

  6. Deploy an endpoint for the best training job / trial component

  7. Cleanup

  8. Contact

Install modules

[2]:
import sys

Install the SageMaker Experiments Python SDK

[3]:
!{sys.executable} -m pip install sagemaker-experiments==0.1.35
Collecting sagemaker-experiments==0.1.35
  Downloading sagemaker_experiments-0.1.35-py3-none-any.whl (42 kB)
     |████████████████████████████████| 42 kB 1.2 MB/s
Requirement already satisfied: boto3>=1.16.27 in /opt/conda/lib/python3.6/site-packages (from sagemaker-experiments==0.1.35) (1.20.7)
Requirement already satisfied: botocore<1.24.0,>=1.23.7 in /opt/conda/lib/python3.6/site-packages (from boto3>=1.16.27->sagemaker-experiments==0.1.35) (1.23.7)
Requirement already satisfied: s3transfer<0.6.0,>=0.5.0 in /opt/conda/lib/python3.6/site-packages (from boto3>=1.16.27->sagemaker-experiments==0.1.35) (0.5.2)
Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /opt/conda/lib/python3.6/site-packages (from boto3>=1.16.27->sagemaker-experiments==0.1.35) (0.10.0)
Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /opt/conda/lib/python3.6/site-packages (from botocore<1.24.0,>=1.23.7->boto3>=1.16.27->sagemaker-experiments==0.1.35) (2.8.1)
Requirement already satisfied: urllib3<1.27,>=1.25.4 in /opt/conda/lib/python3.6/site-packages (from botocore<1.24.0,>=1.23.7->boto3>=1.16.27->sagemaker-experiments==0.1.35) (1.25.11)
Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.6/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.24.0,>=1.23.7->boto3>=1.16.27->sagemaker-experiments==0.1.35) (1.15.0)
Installing collected packages: sagemaker-experiments
  Attempting uninstall: sagemaker-experiments
    Found existing installation: sagemaker-experiments 0.1.7
    Uninstalling sagemaker-experiments-0.1.7:
      Successfully uninstalled sagemaker-experiments-0.1.7
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
sagemaker-pytorch-training 1.3.3 requires sagemaker-experiments==0.1.7; python_version >= "3.6", but you have sagemaker-experiments 0.1.35 which is incompatible.
Successfully installed sagemaker-experiments-0.1.35

Install PyTorch

[4]:
# PyTorch version needs to be the same in both the notebook instance and the training job container
# https://github.com/pytorch/pytorch/issues/25214
!{sys.executable} -m pip install torch==1.1.0
!{sys.executable} -m pip install torchvision==0.2.2
!{sys.executable} -m pip install pillow==6.2.2
!{sys.executable} -m pip install --upgrade sagemaker
Collecting torch==1.1.0
  Downloading torch-1.1.0-cp36-cp36m-manylinux1_x86_64.whl (676.9 MB)
     |████████████████████████████████| 676.9 MB 1.9 kB/s
Requirement already satisfied: numpy in /opt/conda/lib/python3.6/site-packages (from torch==1.1.0) (1.19.5)
Installing collected packages: torch
  Attempting uninstall: torch
    Found existing installation: torch 1.4.0
    Uninstalling torch-1.4.0:
      Successfully uninstalled torch-1.4.0
Successfully installed torch-1.1.0
Collecting torchvision==0.2.2
  Downloading torchvision-0.2.2-py2.py3-none-any.whl (64 kB)
     |████████████████████████████████| 64 kB 2.2 MB/s
Collecting tqdm==4.19.9
  Downloading tqdm-4.19.9-py2.py3-none-any.whl (52 kB)
     |████████████████████████████████| 52 kB 2.0 MB/s
Requirement already satisfied: numpy in /opt/conda/lib/python3.6/site-packages (from torchvision==0.2.2) (1.19.5)
Requirement already satisfied: torch in /opt/conda/lib/python3.6/site-packages (from torchvision==0.2.2) (1.1.0)
Requirement already satisfied: six in /opt/conda/lib/python3.6/site-packages (from torchvision==0.2.2) (1.15.0)
Requirement already satisfied: pillow>=4.1.1 in /opt/conda/lib/python3.6/site-packages (from torchvision==0.2.2) (8.1.2)
Installing collected packages: tqdm, torchvision
  Attempting uninstall: tqdm
    Found existing installation: tqdm 4.56.0
    Uninstalling tqdm-4.56.0:
      Successfully uninstalled tqdm-4.56.0
  Attempting uninstall: torchvision
    Found existing installation: torchvision 0.5.0+cpu
    Uninstalling torchvision-0.5.0+cpu:
      Successfully uninstalled torchvision-0.5.0+cpu
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
spacy 3.0.5 requires tqdm<5.0.0,>=4.38.0, but you have tqdm 4.19.9 which is incompatible.
papermill 2.3.4 requires tqdm>=4.32.2, but you have tqdm 4.19.9 which is incompatible.
Successfully installed torchvision-0.2.2 tqdm-4.19.9
Collecting pillow==6.2.2
  Downloading Pillow-6.2.2-cp36-cp36m-manylinux1_x86_64.whl (2.1 MB)
     |████████████████████████████████| 2.1 MB 7.0 MB/s
Installing collected packages: pillow
  Attempting uninstall: pillow
    Found existing installation: Pillow 8.1.2
    Uninstalling Pillow-8.1.2:
      Successfully uninstalled Pillow-8.1.2
Successfully installed pillow-6.2.2
Requirement already satisfied: sagemaker in /opt/conda/lib/python3.6/site-packages (2.69.1.dev0)
Collecting sagemaker
  Downloading sagemaker-2.86.2.tar.gz (521 kB)
     |████████████████████████████████| 521 kB 2.6 MB/s
Requirement already satisfied: attrs==20.3.0 in /opt/conda/lib/python3.6/site-packages (from sagemaker) (20.3.0)
Collecting boto3>=1.20.21
  Downloading boto3-1.21.42-py3-none-any.whl (132 kB)
     |████████████████████████████████| 132 kB 72.2 MB/s
Requirement already satisfied: google-pasta in /opt/conda/lib/python3.6/site-packages (from sagemaker) (0.2.0)
Requirement already satisfied: numpy>=1.9.0 in /opt/conda/lib/python3.6/site-packages (from sagemaker) (1.19.5)
Requirement already satisfied: protobuf>=3.1 in /opt/conda/lib/python3.6/site-packages (from sagemaker) (3.15.5)
Requirement already satisfied: protobuf3-to-dict>=0.1.5 in /opt/conda/lib/python3.6/site-packages (from sagemaker) (0.1.5)
Requirement already satisfied: smdebug_rulesconfig==1.0.1 in /opt/conda/lib/python3.6/site-packages (from sagemaker) (1.0.1)
Requirement already satisfied: importlib-metadata>=1.4.0 in /opt/conda/lib/python3.6/site-packages (from sagemaker) (3.7.2)
Requirement already satisfied: packaging>=20.0 in /opt/conda/lib/python3.6/site-packages (from sagemaker) (20.9)
Requirement already satisfied: pandas in /opt/conda/lib/python3.6/site-packages (from sagemaker) (0.25.0)
Requirement already satisfied: pathos in /opt/conda/lib/python3.6/site-packages (from sagemaker) (0.2.8)
Collecting botocore<1.25.0,>=1.24.42
  Downloading botocore-1.24.42-py3-none-any.whl (8.7 MB)
     |████████████████████████████████| 8.7 MB 68.3 MB/s
Requirement already satisfied: jmespath<2.0.0,>=0.7.1 in /opt/conda/lib/python3.6/site-packages (from boto3>=1.20.21->sagemaker) (0.10.0)
Requirement already satisfied: s3transfer<0.6.0,>=0.5.0 in /opt/conda/lib/python3.6/site-packages (from boto3>=1.20.21->sagemaker) (0.5.2)
Requirement already satisfied: urllib3<1.27,>=1.25.4 in /opt/conda/lib/python3.6/site-packages (from botocore<1.25.0,>=1.24.42->boto3>=1.20.21->sagemaker) (1.25.11)
Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /opt/conda/lib/python3.6/site-packages (from botocore<1.25.0,>=1.24.42->boto3>=1.20.21->sagemaker) (2.8.1)
Requirement already satisfied: zipp>=0.5 in /opt/conda/lib/python3.6/site-packages (from importlib-metadata>=1.4.0->sagemaker) (3.4.1)
Requirement already satisfied: typing-extensions>=3.6.4 in /opt/conda/lib/python3.6/site-packages (from importlib-metadata>=1.4.0->sagemaker) (3.7.4.3)
Requirement already satisfied: pyparsing>=2.0.2 in /opt/conda/lib/python3.6/site-packages (from packaging>=20.0->sagemaker) (2.4.7)
Requirement already satisfied: six>=1.9 in /opt/conda/lib/python3.6/site-packages (from protobuf>=3.1->sagemaker) (1.15.0)
Requirement already satisfied: pytz>=2017.2 in /opt/conda/lib/python3.6/site-packages (from pandas->sagemaker) (2021.1)
Requirement already satisfied: dill>=0.3.4 in /opt/conda/lib/python3.6/site-packages (from pathos->sagemaker) (0.3.4)
Requirement already satisfied: pox>=0.3.0 in /opt/conda/lib/python3.6/site-packages (from pathos->sagemaker) (0.3.0)
Requirement already satisfied: multiprocess>=0.70.12 in /opt/conda/lib/python3.6/site-packages (from pathos->sagemaker) (0.70.12.2)
Requirement already satisfied: ppft>=1.6.6.4 in /opt/conda/lib/python3.6/site-packages (from pathos->sagemaker) (1.6.6.4)
Building wheels for collected packages: sagemaker
  Building wheel for sagemaker (setup.py) ... - \ | / - \ | done
  Created wheel for sagemaker: filename=sagemaker-2.86.2-py2.py3-none-any.whl size=720848 sha256=cd25f5dc289c4a4e7a1b1e595b0668b954595f98e004a24f4c9f7dfc444fe275
  Stored in directory: /root/.cache/pip/wheels/59/43/38/ebab0cc66165586b93249bb62b88af317edd25ecd7885b496b
Successfully built sagemaker
Installing collected packages: botocore, boto3, sagemaker
  Attempting uninstall: botocore
    Found existing installation: botocore 1.23.7
    Uninstalling botocore-1.23.7:
      Successfully uninstalled botocore-1.23.7
  Attempting uninstall: boto3
    Found existing installation: boto3 1.20.7
    Uninstalling boto3-1.20.7:
      Successfully uninstalled boto3-1.20.7
  Attempting uninstall: sagemaker
    Found existing installation: sagemaker 2.69.1.dev0
    Uninstalling sagemaker-2.69.1.dev0:
      Successfully uninstalled sagemaker-2.69.1.dev0
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
awscli 1.22.7 requires botocore==1.23.7, but you have botocore 1.24.42 which is incompatible.
Successfully installed boto3-1.21.42 botocore-1.24.42 sagemaker-2.86.2

Setup

[5]:
import time

import boto3
import numpy as np
import pandas as pd
from IPython.display import set_matplotlib_formats
from matplotlib import pyplot as plt
from torchvision import datasets, transforms

import sagemaker
from sagemaker import get_execution_role
from sagemaker.session import Session
from sagemaker.analytics import ExperimentAnalytics

from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
from smexperiments.trial_component import TrialComponent
from smexperiments.tracker import Tracker

set_matplotlib_formats("retina")
[6]:
sm_sess = sagemaker.Session()
sess = sm_sess.boto_session
sm = sm_sess.sagemaker_client
role = get_execution_role()

Download the dataset

We download the MNIST handwritten digits dataset, and then apply a transformation on each image.

[7]:
bucket = sm_sess.default_bucket()
prefix = "DEMO-mnist"
print("Using S3 location: s3://" + bucket + "/" + prefix + "/")

datasets.MNIST.urls = [
    "https://sagemaker-sample-files.s3.amazonaws.com/datasets/image/MNIST/train-images-idx3-ubyte.gz",
    "https://sagemaker-sample-files.s3.amazonaws.com/datasets/image/MNIST/train-labels-idx1-ubyte.gz",
    "https://sagemaker-sample-files.s3.amazonaws.com/datasets/image/MNIST/t10k-images-idx3-ubyte.gz",
    "https://sagemaker-sample-files.s3.amazonaws.com/datasets/image/MNIST/t10k-labels-idx1-ubyte.gz",
]

# Download the dataset to the ./mnist folder, and load and transform (normalize) them
train_set = datasets.MNIST(
    "mnist",
    train=True,
    transform=transforms.Compose(
        [transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))]
    ),
    download=True,
)

test_set = datasets.MNIST(
    "mnist",
    train=False,
    transform=transforms.Compose(
        [transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))]
    ),
    download=False,
)
0.00B [00:00, ?B/s]
Using S3 location: s3://sagemaker-us-west-2-000000000000/DEMO-mnist/
Downloading https://sagemaker-sample-files.s3.amazonaws.com/datasets/image/MNIST/train-images-idx3-ubyte.gz to mnist/MNIST/raw/train-images-idx3-ubyte.gz
9.92MB [00:01, 8.73MB/s]
Extracting mnist/MNIST/raw/train-images-idx3-ubyte.gz
0.00B [00:00, ?B/s]
Downloading https://sagemaker-sample-files.s3.amazonaws.com/datasets/image/MNIST/train-labels-idx1-ubyte.gz to mnist/MNIST/raw/train-labels-idx1-ubyte.gz
32.8kB [00:00, 87.6kB/s]
0.00B [00:00, ?B/s]
Extracting mnist/MNIST/raw/train-labels-idx1-ubyte.gz
Downloading https://sagemaker-sample-files.s3.amazonaws.com/datasets/image/MNIST/t10k-images-idx3-ubyte.gz to mnist/MNIST/raw/t10k-images-idx3-ubyte.gz
1.65MB [00:00, 2.26MB/s]
0.00B [00:00, ?B/s]
Extracting mnist/MNIST/raw/t10k-images-idx3-ubyte.gz
Downloading https://sagemaker-sample-files.s3.amazonaws.com/datasets/image/MNIST/t10k-labels-idx1-ubyte.gz to mnist/MNIST/raw/t10k-labels-idx1-ubyte.gz
8.19kB [00:00, 25.9kB/s]
Extracting mnist/MNIST/raw/t10k-labels-idx1-ubyte.gz
Processing...
Done!

View an example image from the dataset.

[8]:
plt.imshow(train_set.data[2].numpy())
[8]:
<matplotlib.image.AxesImage at 0x7f420f2722b0>
../../_images/sagemaker-experiments_mnist-handwritten-digits-classification-experiment_mnist-handwritten-digits-classification-experiment_outputs_13_1.png

After transforming the images in the dataset, we upload it to S3.

[9]:
inputs = sagemaker.Session().upload_data(path="mnist", bucket=bucket, key_prefix=prefix)

Now let’s track the parameters from the data pre-processing step.

[10]:
with Tracker.create(display_name="Preprocessing", sagemaker_boto_client=sm) as tracker:
    tracker.log_parameters(
        {
            "normalization_mean": 0.1307,
            "normalization_std": 0.3081,
        }
    )
    # We can log the S3 uri to the dataset we just uploaded
    tracker.log_input(name="mnist-dataset", media_type="s3/uri", value=inputs)

Step 1: Set up the Experiment

Create an experiment to track all the model training iterations. Experiments are a great way to organize your data science work. You can create experiments to organize all your model development work for: [1] a business use case you are addressing (e.g. create experiment named “customer churn prediction”), or [2] a data science team that owns the experiment (e.g. create experiment named “marketing analytics experiment”), or [3] a specific data science and ML project. Think of it as a “folder” for organizing your “files”.

Create an Experiment

[11]:
mnist_experiment = Experiment.create(
    experiment_name=f"mnist-hand-written-digits-classification-{int(time.time())}",
    description="Classification of mnist hand-written digits",
    sagemaker_boto_client=sm,
)
print(mnist_experiment)
Experiment(sagemaker_boto_client=<botocore.client.SageMaker object at 0x7f421003aa58>,experiment_name='mnist-hand-written-digits-classification-1650241873',description='Classification of mnist hand-written digits',tags=None,experiment_arn='arn:aws:sagemaker:us-west-2:000000000000:experiment/mnist-hand-written-digits-classification-1650241873',response_metadata={'RequestId': 'c388a42f-c23f-47c1-836f-3fd10e059ada', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': 'c388a42f-c23f-47c1-836f-3fd10e059ada', 'content-type': 'application/x-amz-json-1.1', 'content-length': '123', 'date': 'Mon, 18 Apr 2022 00:31:13 GMT'}, 'RetryAttempts': 0})

Step 2: Track Experiment

Now create a Trial for each training run to track its inputs, parameters, and metrics.

While training the CNN model on SageMaker, we experiment with several values for the number of hidden channel in the model. We create a Trial to track each training job run. We also create a TrialComponent from the tracker we created before, and add to the Trial. This enriches the Trial with the parameters we captured from the data pre-processing stage.

[12]:
from sagemaker.pytorch import PyTorch, PyTorchModel
[13]:
hidden_channel_trial_name_map = {}

If you want to run the following five training jobs in parallel, you may need to increase your resource limit. Here we run them sequentially.

[14]:
preprocessing_trial_component = tracker.trial_component
[15]:
for i, num_hidden_channel in enumerate([2, 5, 10, 20, 32]):
    # Create trial
    trial_name = f"cnn-training-job-{num_hidden_channel}-hidden-channels-{int(time.time())}"
    cnn_trial = Trial.create(
        trial_name=trial_name,
        experiment_name=mnist_experiment.experiment_name,
        sagemaker_boto_client=sm,
    )
    hidden_channel_trial_name_map[num_hidden_channel] = trial_name

    # Associate the proprocessing trial component with the current trial
    cnn_trial.add_trial_component(preprocessing_trial_component)

    # All input configurations, parameters, and metrics specified in
    # the estimator definition are automatically tracked
    estimator = PyTorch(
        py_version="py3",
        entry_point="./mnist.py",
        role=role,
        sagemaker_session=sagemaker.Session(sagemaker_client=sm),
        framework_version="1.1.0",
        instance_count=1,
        instance_type="ml.c4.xlarge",
        hyperparameters={
            "epochs": 2,
            "backend": "gloo",
            "hidden_channels": num_hidden_channel,
            "dropout": 0.2,
            "kernel_size": 5,
            "optimizer": "sgd",
        },
        metric_definitions=[
            {"Name": "train:loss", "Regex": "Train Loss: (.*?);"},
            {"Name": "test:loss", "Regex": "Test Average loss: (.*?),"},
            {"Name": "test:accuracy", "Regex": "Test Accuracy: (.*?)%;"},
        ],
        enable_sagemaker_metrics=True,
    )

    cnn_training_job_name = "cnn-training-job-{}".format(int(time.time()))

    # Associate the estimator with the Experiment and Trial
    estimator.fit(
        inputs={"training": inputs},
        job_name=cnn_training_job_name,
        experiment_config={
            "TrialName": cnn_trial.trial_name,
            "TrialComponentDisplayName": "Training",
        },
        wait=True,
    )

    # Wait two seconds before dispatching the next training job
    time.sleep(2)
INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: latest.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker:Creating training-job with name: cnn-training-job-1650241888
...

Compare the model training runs for an experiment

Now we use the analytics capabilities of the Experiments SDK to query and compare the training runs for identifying the best model produced by our experiment. You can retrieve trial components by using a search expression.

Some Simple Analyses

[16]:
search_expression = {
    "Filters": [
        {
            "Name": "DisplayName",
            "Operator": "Equals",
            "Value": "Training",
        }
    ],
}
[17]:
trial_component_analytics = ExperimentAnalytics(
    sagemaker_session=Session(sess, sm),
    experiment_name=mnist_experiment.experiment_name,
    search_expression=search_expression,
    sort_by="metrics.test:accuracy.max",
    sort_order="Descending",
    metric_names=["test:accuracy"],
    parameter_names=["hidden_channels", "epochs", "dropout", "optimizer"],
)
[18]:
trial_component_analytics.dataframe()
[18]:
TrialComponentName DisplayName SourceArn dropout epochs hidden_channels optimizer test:accuracy - Min test:accuracy - Max test:accuracy - Avg ... test:accuracy - Last test:accuracy - Count training - MediaType training - Value SageMaker.DebugHookOutput - MediaType SageMaker.DebugHookOutput - Value SageMaker.ModelArtifact - MediaType SageMaker.ModelArtifact - Value Trials Experiments
0 cnn-training-job-1650243007-aws-training-job Training arn:aws:sagemaker:us-west-2:000000000000:train... 0.2 2.0 32.0 "sgd" 95.0 97.0 96.0 ... 97.0 2 None s3://sagemaker-us-west-2-000000000000/DEMO-mnist None s3://sagemaker-us-west-2-000000000000/ None s3://sagemaker-us-west-2-000000000000/cnn-trai... [cnn-training-job-32-hidden-channels-1650243007] [mnist-hand-written-digits-classification-1650...
1 cnn-training-job-1650242721-aws-training-job Training arn:aws:sagemaker:us-west-2:000000000000:train... 0.2 2.0 20.0 "sgd" 0.0 96.0 0.0 ... 97.0 0 None s3://sagemaker-us-west-2-000000000000/DEMO-mnist None s3://sagemaker-us-west-2-000000000000/ None s3://sagemaker-us-west-2-000000000000/cnn-trai... [cnn-training-job-20-hidden-channels-1650242720] [mnist-hand-written-digits-classification-1650...
2 cnn-training-job-1650242174-aws-training-job Training arn:aws:sagemaker:us-west-2:000000000000:train... 0.2 2.0 5.0 "sgd" 0.0 94.0 0.0 ... 96.0 0 None s3://sagemaker-us-west-2-000000000000/DEMO-mnist None s3://sagemaker-us-west-2-000000000000/ None s3://sagemaker-us-west-2-000000000000/cnn-trai... [cnn-training-job-5-hidden-channels-1650242174] [mnist-hand-written-digits-classification-1650...
3 cnn-training-job-1650242433-aws-training-job Training arn:aws:sagemaker:us-west-2:000000000000:train... 0.2 2.0 10.0 "sgd" 0.0 0.0 0.0 ... 97.0 0 None s3://sagemaker-us-west-2-000000000000/DEMO-mnist None s3://sagemaker-us-west-2-000000000000/ None s3://sagemaker-us-west-2-000000000000/cnn-trai... [cnn-training-job-10-hidden-channels-1650242433] [mnist-hand-written-digits-classification-1650...
4 cnn-training-job-1650241888-aws-training-job Training arn:aws:sagemaker:us-west-2:000000000000:train... 0.2 2.0 2.0 "sgd" 0.0 0.0 0.0 ... 97.0 0 None s3://sagemaker-us-west-2-000000000000/DEMO-mnist None s3://sagemaker-us-west-2-000000000000/ None s3://sagemaker-us-west-2-000000000000/cnn-trai... [cnn-training-job-2-hidden-channels-1650241888] [mnist-hand-written-digits-classification-1650...

5 rows × 21 columns

To isolate and measure the impact of change in hidden channels on model accuracy, we vary the number of hidden channel and fix the value for other hyperparameters.

Next let’s look at an example of tracing the lineage of a model by accessing the data tracked by SageMaker Experiments for the cnn-training-job-2-hidden-channels trial.

[19]:
lineage_table = ExperimentAnalytics(
    sagemaker_session=Session(sess, sm),
    search_expression={
        "Filters": [
            {
                "Name": "Parents.TrialName",
                "Operator": "Equals",
                "Value": hidden_channel_trial_name_map[2],
            }
        ]
    },
    sort_by="CreationTime",
    sort_order="Ascending",
)
[20]:
lineage_table.dataframe()
[20]:
TrialComponentName DisplayName normalization_mean normalization_std mnist-dataset - MediaType mnist-dataset - Value Trials Experiments SourceArn SageMaker.ImageUri ... train:loss - Avg train:loss - StdDev train:loss - Last train:loss - Count training - MediaType training - Value SageMaker.DebugHookOutput - MediaType SageMaker.DebugHookOutput - Value SageMaker.ModelArtifact - MediaType SageMaker.ModelArtifact - Value
0 TrialComponent-2022-04-18-003106-yizb Preprocessing 0.1307 0.3081 s3/uri s3://sagemaker-us-west-2-000000000000/DEMO-mnist [cnn-training-job-5-hidden-channels-1650242174... [mnist-hand-written-digits-classification-1650... NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 cnn-training-job-1650241888-aws-training-job Training NaN NaN NaN NaN [cnn-training-job-2-hidden-channels-1650241888] [mnist-hand-written-digits-classification-1650... arn:aws:sagemaker:us-west-2:000000000000:train... 520713654638.dkr.ecr.us-west-2.amazonaws.com/s... ... 0.0 0.0 0.157259 0.0 NaN s3://sagemaker-us-west-2-000000000000/DEMO-mnist NaN s3://sagemaker-us-west-2-000000000000/ NaN s3://sagemaker-us-west-2-000000000000/cnn-trai...

2 rows × 48 columns

Deploy an endpoint for the best training job / trial component

Now we take the best model and deploy it to an endpoint so it is available to perform inference.

[21]:
# Pulling best based on sort in the analytics/dataframe, so first is best....
best_trial_component_name = trial_component_analytics.dataframe().iloc[0]["TrialComponentName"]
best_trial_component = TrialComponent.load(best_trial_component_name)

model_data = best_trial_component.output_artifacts["SageMaker.ModelArtifact"].value
env = {
    "hidden_channels": str(int(best_trial_component.parameters["hidden_channels"])),
    "dropout": str(best_trial_component.parameters["dropout"]),
    "kernel_size": str(int(best_trial_component.parameters["kernel_size"])),
}
model = PyTorchModel(
    model_data,
    role,
    "./mnist.py",
    py_version="py3",
    env=env,
    sagemaker_session=sagemaker.Session(sagemaker_client=sm),
    framework_version="1.1.0",
    name=best_trial_component.trial_component_name,
)

predictor = model.deploy(instance_type="ml.m5.xlarge", initial_instance_count=1)
INFO:sagemaker:Creating model with name: cnn-training-job-1650243007-aws-training-job
INFO:sagemaker:Creating endpoint-config with name cnn-training-job-1650243007-aws-trainin-2022-04-18-00-55-04-889
INFO:sagemaker:Creating endpoint with name cnn-training-job-1650243007-aws-trainin-2022-04-18-00-55-04-889
----!

Cleanup

Once we’re done, clean up the endpoint to prevent unnecessary billing.

[22]:
predictor.delete_endpoint()
INFO:sagemaker:Deleting endpoint configuration with name: cnn-training-job-1650243007-aws-trainin-2022-04-18-00-55-04-889
INFO:sagemaker:Deleting endpoint with name: cnn-training-job-1650243007-aws-trainin-2022-04-18-00-55-04-889

Trial components can exist independently of trials and experiments. You might want keep them if you plan on further exploration. If not, delete all experiment artifacts.

[23]:
mnist_experiment.delete_all(action="--force")

Contact

Submit any questions or issues to https://github.com/aws/sagemaker-experiments/issues or mention @aws/sagemakerexperimentsadmin