[1]:
# Parameters
kms_key = "arn:aws:kms:us-west-2:000000000000:1234abcd-12ab-34cd-56ef-1234567890ab"

Run a SageMaker Experiment with MNIST Handwritten Digits Classification

This demo shows how you can use the SageMaker Experiments Python SDK to organize, track, compare, and evaluate your machine learning (ML) model training experiments.

You can track artifacts for experiments, including data sets, algorithms, hyperparameters, and metrics. Experiments executed on SageMaker such as SageMaker Autopilot jobs and training jobs are automatically tracked. You can also track artifacts for additional steps within an ML workflow that come before or after model training, such as data pre-processing or post-training model evaluation.

The APIs also let you search and browse your current and past experiments, compare experiments, and identify best-performing models.

We demonstrate these capabilities through an MNIST handwritten digits classification example. The experiment is organized as follows:

  1. Download and prepare the MNIST dataset.

  2. Train a Convolutional Neural Network (CNN) Model. Tune the hyperparameter that configures the number of hidden channels in the model. Track the parameter configurations and resulting model accuracy using the SageMaker Experiments Python SDK.

  3. Finally use the search and analytics capabilities of the SDK to search, compare and evaluate the performance of all model versions generated from model tuning in Step 2.

  4. We also show an example of tracing the complete lineage of a model version: the collection of all the data pre-processing and training configurations and inputs that went into creating that model version.

Make sure you select the Python 3 (Data Science) kernel in Studio, or conda_pytorch_p36 in a notebook instance.

Runtime

This notebook takes approximately 25 minutes to run.

Contents

  1. Install modules

  2. Setup

  3. Download the dataset

  4. Step 1: Set up the Experiment

  5. Step 2: Track Experiment

  6. Deploy an endpoint for the best training job / trial component

  7. Cleanup

  8. Contact

Install modules

[2]:
import sys

Install the SageMaker Experiments Python SDK

[3]:
!{sys.executable} -m pip install sagemaker-experiments==0.1.35
/opt/conda/lib/python3.7/site-packages/secretstorage/dhcrypto.py:16: CryptographyDeprecationWarning: int_from_bytes is deprecated, use int.from_bytes instead
  from cryptography.utils import int_from_bytes
/opt/conda/lib/python3.7/site-packages/secretstorage/util.py:25: CryptographyDeprecationWarning: int_from_bytes is deprecated, use int.from_bytes instead
  from cryptography.utils import int_from_bytes
Collecting sagemaker-experiments==0.1.35
  Downloading sagemaker_experiments-0.1.35-py3-none-any.whl (42 kB)
     |████████████████████████████████| 42 kB 1.3 MB/s
Requirement already satisfied: boto3>=1.16.27 in /opt/conda/lib/python3.7/site-packages (from sagemaker-experiments==0.1.35) (1.20.47)
Requirement already satisfied: botocore<1.24.0,>=1.23.47 in /opt/conda/lib/python3.7/site-packages (from boto3>=1.16.27->sagemaker-experiments==0.1.35) (1.23.47)
Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /opt/conda/lib/python3.7/site-packages (from boto3>=1.16.27->sagemaker-experiments==0.1.35) (0.10.0)
Requirement already satisfied: s3transfer<0.6.0,>=0.5.0 in /opt/conda/lib/python3.7/site-packages (from boto3>=1.16.27->sagemaker-experiments==0.1.35) (0.5.0)
Requirement already satisfied: urllib3<1.27,>=1.25.4 in /opt/conda/lib/python3.7/site-packages (from botocore<1.24.0,>=1.23.47->boto3>=1.16.27->sagemaker-experiments==0.1.35) (1.26.6)
Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /opt/conda/lib/python3.7/site-packages (from botocore<1.24.0,>=1.23.47->boto3>=1.16.27->sagemaker-experiments==0.1.35) (2.8.1)
Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.24.0,>=1.23.47->boto3>=1.16.27->sagemaker-experiments==0.1.35) (1.14.0)
Installing collected packages: sagemaker-experiments
Successfully installed sagemaker-experiments-0.1.35
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
WARNING: You are using pip version 21.1.3; however, version 22.2.2 is available.
You should consider upgrading via the '/opt/conda/bin/python3 -m pip install --upgrade pip' command.

Install PyTorch

[4]:
# PyTorch version needs to be the same in both the notebook instance and the training job container
# https://github.com/pytorch/pytorch/issues/25214
!{sys.executable} -m pip install torch==1.1.0
!{sys.executable} -m pip install torchvision==0.2.2
!{sys.executable} -m pip install pillow==6.2.2
!{sys.executable} -m pip install --upgrade sagemaker
/opt/conda/lib/python3.7/site-packages/secretstorage/dhcrypto.py:16: CryptographyDeprecationWarning: int_from_bytes is deprecated, use int.from_bytes instead
  from cryptography.utils import int_from_bytes
/opt/conda/lib/python3.7/site-packages/secretstorage/util.py:25: CryptographyDeprecationWarning: int_from_bytes is deprecated, use int.from_bytes instead
  from cryptography.utils import int_from_bytes
Collecting torch==1.1.0
  Downloading torch-1.1.0-cp37-cp37m-manylinux1_x86_64.whl (676.9 MB)
     |████████████████████████████████| 676.9 MB 1.4 kB/s
Requirement already satisfied: numpy in /opt/conda/lib/python3.7/site-packages (from torch==1.1.0) (1.21.1)
Installing collected packages: torch
Successfully installed torch-1.1.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
WARNING: You are using pip version 21.1.3; however, version 22.2.2 is available.
You should consider upgrading via the '/opt/conda/bin/python3 -m pip install --upgrade pip' command.
/opt/conda/lib/python3.7/site-packages/secretstorage/dhcrypto.py:16: CryptographyDeprecationWarning: int_from_bytes is deprecated, use int.from_bytes instead
  from cryptography.utils import int_from_bytes
/opt/conda/lib/python3.7/site-packages/secretstorage/util.py:25: CryptographyDeprecationWarning: int_from_bytes is deprecated, use int.from_bytes instead
  from cryptography.utils import int_from_bytes
Collecting torchvision==0.2.2
  Downloading torchvision-0.2.2-py2.py3-none-any.whl (64 kB)
     |████████████████████████████████| 64 kB 2.0 MB/s
Collecting tqdm==4.19.9
  Downloading tqdm-4.19.9-py2.py3-none-any.whl (52 kB)
     |████████████████████████████████| 52 kB 1.7 MB/s
Requirement already satisfied: numpy in /opt/conda/lib/python3.7/site-packages (from torchvision==0.2.2) (1.21.1)
Requirement already satisfied: six in /opt/conda/lib/python3.7/site-packages (from torchvision==0.2.2) (1.14.0)
Requirement already satisfied: torch in /opt/conda/lib/python3.7/site-packages (from torchvision==0.2.2) (1.1.0)
Requirement already satisfied: pillow>=4.1.1 in /opt/conda/lib/python3.7/site-packages (from torchvision==0.2.2) (8.3.1)
Installing collected packages: tqdm, torchvision
  Attempting uninstall: tqdm
    Found existing installation: tqdm 4.42.1
    Uninstalling tqdm-4.42.1:
      Successfully uninstalled tqdm-4.42.1
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
papermill 2.3.4 requires tqdm>=4.32.2, but you have tqdm 4.19.9 which is incompatible.
Successfully installed torchvision-0.2.2 tqdm-4.19.9
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
WARNING: You are using pip version 21.1.3; however, version 22.2.2 is available.
You should consider upgrading via the '/opt/conda/bin/python3 -m pip install --upgrade pip' command.
/opt/conda/lib/python3.7/site-packages/secretstorage/dhcrypto.py:16: CryptographyDeprecationWarning: int_from_bytes is deprecated, use int.from_bytes instead
  from cryptography.utils import int_from_bytes
/opt/conda/lib/python3.7/site-packages/secretstorage/util.py:25: CryptographyDeprecationWarning: int_from_bytes is deprecated, use int.from_bytes instead
  from cryptography.utils import int_from_bytes
Collecting pillow==6.2.2
  Downloading Pillow-6.2.2-cp37-cp37m-manylinux1_x86_64.whl (2.1 MB)
     |████████████████████████████████| 2.1 MB 2.0 MB/s
Installing collected packages: pillow
  Attempting uninstall: pillow
    Found existing installation: Pillow 8.3.1
    Uninstalling Pillow-8.3.1:
      Successfully uninstalled Pillow-8.3.1
Successfully installed pillow-6.2.2
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
WARNING: You are using pip version 21.1.3; however, version 22.2.2 is available.
You should consider upgrading via the '/opt/conda/bin/python3 -m pip install --upgrade pip' command.
/opt/conda/lib/python3.7/site-packages/secretstorage/dhcrypto.py:16: CryptographyDeprecationWarning: int_from_bytes is deprecated, use int.from_bytes instead
  from cryptography.utils import int_from_bytes
/opt/conda/lib/python3.7/site-packages/secretstorage/util.py:25: CryptographyDeprecationWarning: int_from_bytes is deprecated, use int.from_bytes instead
  from cryptography.utils import int_from_bytes
Requirement already satisfied: sagemaker in /opt/conda/lib/python3.7/site-packages (2.103.0)
Requirement already satisfied: smdebug-rulesconfig==1.0.1 in /opt/conda/lib/python3.7/site-packages (from sagemaker) (1.0.1)
Requirement already satisfied: google-pasta in /opt/conda/lib/python3.7/site-packages (from sagemaker) (0.2.0)
Requirement already satisfied: pandas in /opt/conda/lib/python3.7/site-packages (from sagemaker) (1.0.1)
Requirement already satisfied: pathos in /opt/conda/lib/python3.7/site-packages (from sagemaker) (0.2.8)
Requirement already satisfied: boto3<2.0,>=1.20.21 in /opt/conda/lib/python3.7/site-packages (from sagemaker) (1.20.47)
Requirement already satisfied: protobuf<4.0,>=3.1 in /opt/conda/lib/python3.7/site-packages (from sagemaker) (3.17.3)
Requirement already satisfied: attrs<22,>=20.3.0 in /opt/conda/lib/python3.7/site-packages (from sagemaker) (21.4.0)
Requirement already satisfied: numpy<2.0,>=1.9.0 in /opt/conda/lib/python3.7/site-packages (from sagemaker) (1.21.1)
Requirement already satisfied: protobuf3-to-dict<1.0,>=0.1.5 in /opt/conda/lib/python3.7/site-packages (from sagemaker) (0.1.5)
Requirement already satisfied: packaging>=20.0 in /opt/conda/lib/python3.7/site-packages (from sagemaker) (20.1)
Requirement already satisfied: importlib-metadata<5.0,>=1.4.0 in /opt/conda/lib/python3.7/site-packages (from sagemaker) (1.5.0)
Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /opt/conda/lib/python3.7/site-packages (from boto3<2.0,>=1.20.21->sagemaker) (0.10.0)
Requirement already satisfied: s3transfer<0.6.0,>=0.5.0 in /opt/conda/lib/python3.7/site-packages (from boto3<2.0,>=1.20.21->sagemaker) (0.5.0)
Requirement already satisfied: botocore<1.24.0,>=1.23.47 in /opt/conda/lib/python3.7/site-packages (from boto3<2.0,>=1.20.21->sagemaker) (1.23.47)
Requirement already satisfied: urllib3<1.27,>=1.25.4 in /opt/conda/lib/python3.7/site-packages (from botocore<1.24.0,>=1.23.47->boto3<2.0,>=1.20.21->sagemaker) (1.26.6)
Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /opt/conda/lib/python3.7/site-packages (from botocore<1.24.0,>=1.23.47->boto3<2.0,>=1.20.21->sagemaker) (2.8.1)
Requirement already satisfied: zipp>=0.5 in /opt/conda/lib/python3.7/site-packages (from importlib-metadata<5.0,>=1.4.0->sagemaker) (2.2.0)
Requirement already satisfied: six in /opt/conda/lib/python3.7/site-packages (from packaging>=20.0->sagemaker) (1.14.0)
Requirement already satisfied: pyparsing>=2.0.2 in /opt/conda/lib/python3.7/site-packages (from packaging>=20.0->sagemaker) (2.4.6)
Requirement already satisfied: pytz>=2017.2 in /opt/conda/lib/python3.7/site-packages (from pandas->sagemaker) (2019.3)
Requirement already satisfied: pox>=0.3.0 in /opt/conda/lib/python3.7/site-packages (from pathos->sagemaker) (0.3.0)
Requirement already satisfied: ppft>=1.6.6.4 in /opt/conda/lib/python3.7/site-packages (from pathos->sagemaker) (1.6.6.4)
Requirement already satisfied: multiprocess>=0.70.12 in /opt/conda/lib/python3.7/site-packages (from pathos->sagemaker) (0.70.12.2)
Requirement already satisfied: dill>=0.3.4 in /opt/conda/lib/python3.7/site-packages (from pathos->sagemaker) (0.3.4)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
WARNING: You are using pip version 21.1.3; however, version 22.2.2 is available.
You should consider upgrading via the '/opt/conda/bin/python3 -m pip install --upgrade pip' command.

Setup

[5]:
import time

import boto3
import numpy as np
import pandas as pd
from IPython.display import set_matplotlib_formats
from matplotlib import pyplot as plt
from torchvision import datasets, transforms

import sagemaker
from sagemaker import get_execution_role
from sagemaker.session import Session
from sagemaker.analytics import ExperimentAnalytics

from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
from smexperiments.trial_component import TrialComponent
from smexperiments.tracker import Tracker

set_matplotlib_formats("retina")
[6]:
sm_sess = sagemaker.Session()
sess = sm_sess.boto_session
sm = sm_sess.sagemaker_client
role = get_execution_role()
region = sess.region_name

Download the dataset

We download the MNIST handwritten digits dataset, and then apply a transformation on each image.

[7]:
bucket = sm_sess.default_bucket()
prefix = "DEMO-mnist"
print("Using S3 location: s3://" + bucket + "/" + prefix + "/")

datasets.MNIST.urls = [
    "https://sagemaker-sample-files.s3.amazonaws.com/datasets/image/MNIST/train-images-idx3-ubyte.gz",
    "https://sagemaker-sample-files.s3.amazonaws.com/datasets/image/MNIST/train-labels-idx1-ubyte.gz",
    "https://sagemaker-sample-files.s3.amazonaws.com/datasets/image/MNIST/t10k-images-idx3-ubyte.gz",
    "https://sagemaker-sample-files.s3.amazonaws.com/datasets/image/MNIST/t10k-labels-idx1-ubyte.gz",
]

# Download the dataset to the ./mnist folder, and load and transform (normalize) them
train_set = datasets.MNIST(
    "mnist",
    train=True,
    transform=transforms.Compose(
        [transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))]
    ),
    download=True,
)

test_set = datasets.MNIST(
    "mnist",
    train=False,
    transform=transforms.Compose(
        [transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))]
    ),
    download=False,
)
0.00B [00:00, ?B/s]
Using S3 location: s3://sagemaker-us-west-2-000000000000/DEMO-mnist/
Downloading https://sagemaker-sample-files.s3.amazonaws.com/datasets/image/MNIST/train-images-idx3-ubyte.gz to mnist/MNIST/raw/train-images-idx3-ubyte.gz
9.92MB [00:00, 10.6MB/s]
Extracting mnist/MNIST/raw/train-images-idx3-ubyte.gz
0.00B [00:00, ?B/s]
Downloading https://sagemaker-sample-files.s3.amazonaws.com/datasets/image/MNIST/train-labels-idx1-ubyte.gz to mnist/MNIST/raw/train-labels-idx1-ubyte.gz
32.8kB [00:00, 86.8kB/s]
0.00B [00:00, ?B/s]
Extracting mnist/MNIST/raw/train-labels-idx1-ubyte.gz
Downloading https://sagemaker-sample-files.s3.amazonaws.com/datasets/image/MNIST/t10k-images-idx3-ubyte.gz to mnist/MNIST/raw/t10k-images-idx3-ubyte.gz
1.65MB [00:00, 2.33MB/s]
0.00B [00:00, ?B/s]
Extracting mnist/MNIST/raw/t10k-images-idx3-ubyte.gz
Downloading https://sagemaker-sample-files.s3.amazonaws.com/datasets/image/MNIST/t10k-labels-idx1-ubyte.gz to mnist/MNIST/raw/t10k-labels-idx1-ubyte.gz
8.19kB [00:00, 26.4kB/s]
Extracting mnist/MNIST/raw/t10k-labels-idx1-ubyte.gz
Processing...
Done!

View an example image from the dataset.

[8]:
plt.imshow(train_set.data[2].numpy())
[8]:
<matplotlib.image.AxesImage at 0x7f50bb2989d0>
../../_images/sagemaker-experiments_mnist-handwritten-digits-classification-experiment_mnist-handwritten-digits-classification-experiment_outputs_14_1.png

After transforming the images in the dataset, we upload it to S3.

[9]:
inputs = sagemaker.Session().upload_data(path="mnist", bucket=bucket, key_prefix=prefix)

Now let’s track the parameters from the data pre-processing step.

[10]:
with Tracker.create(display_name="Preprocessing", sagemaker_boto_client=sm) as tracker:
    tracker.log_parameters(
        {
            "normalization_mean": 0.1307,
            "normalization_std": 0.3081,
        }
    )
    # We can log the S3 uri to the dataset we just uploaded
    tracker.log_input(name="mnist-dataset", media_type="s3/uri", value=inputs)

Step 1: Set up the Experiment

Create an experiment to track all the model training iterations. Experiments are a great way to organize your data science work. You can create experiments to organize all your model development work for: [1] a business use case you are addressing (e.g. create experiment named “customer churn prediction”), or [2] a data science team that owns the experiment (e.g. create experiment named “marketing analytics experiment”), or [3] a specific data science and ML project. Think of it as a “folder” for organizing your “files”.

Create an Experiment

[11]:
mnist_experiment = Experiment.create(
    experiment_name=f"mnist-hand-written-digits-classification-{int(time.time())}",
    description="Classification of mnist hand-written digits",
    sagemaker_boto_client=sm,
)
print(mnist_experiment)
Experiment(sagemaker_boto_client=<botocore.client.SageMaker object at 0x7f50bc1de150>,experiment_name='mnist-hand-written-digits-classification-1660168470',description='Classification of mnist hand-written digits',tags=None,experiment_arn='arn:aws:sagemaker:us-west-2:000000000000:experiment/mnist-hand-written-digits-classification-1660168470',response_metadata={'RequestId': '1ce1ef68-3c15-4971-b77a-4c5f258887b7', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '1ce1ef68-3c15-4971-b77a-4c5f258887b7', 'content-type': 'application/x-amz-json-1.1', 'content-length': '123', 'date': 'Wed, 10 Aug 2022 21:54:29 GMT'}, 'RetryAttempts': 0})

Step 2: Track Experiment

Now create a Trial for each training run to track its inputs, parameters, and metrics.

While training the CNN model on SageMaker, we experiment with several values for the number of hidden channel in the model. We create a Trial to track each training job run. We also create a TrialComponent from the tracker we created before, and add to the Trial. This enriches the Trial with the parameters we captured from the data pre-processing stage.

[12]:
from sagemaker.pytorch import PyTorch, PyTorchModel
[13]:
hidden_channel_trial_name_map = {}

If you want to run the following five training jobs in parallel, you may need to increase your resource limit. Here we run them sequentially.

[14]:
preprocessing_trial_component = tracker.trial_component
[15]:
for i, num_hidden_channel in enumerate([2, 5, 10, 20, 32]):
    # Create trial
    trial_name = f"cnn-training-job-{num_hidden_channel}-hidden-channels-{int(time.time())}"
    cnn_trial = Trial.create(
        trial_name=trial_name,
        experiment_name=mnist_experiment.experiment_name,
        sagemaker_boto_client=sm,
    )
    hidden_channel_trial_name_map[num_hidden_channel] = trial_name

    # Associate the proprocessing trial component with the current trial
    cnn_trial.add_trial_component(preprocessing_trial_component)

    # All input configurations, parameters, and metrics specified in
    # the estimator definition are automatically tracked
    estimator = PyTorch(
        py_version="py3",
        entry_point="./mnist.py",
        role=role,
        sagemaker_session=sagemaker.Session(sagemaker_client=sm),
        framework_version="1.1.0",
        instance_count=1,
        instance_type="ml.c4.xlarge",
        hyperparameters={
            "epochs": 2,
            "backend": "gloo",
            "hidden_channels": num_hidden_channel,
            "dropout": 0.2,
            "kernel_size": 5,
            "optimizer": "sgd",
        },
        metric_definitions=[
            {"Name": "train:loss", "Regex": "Train Loss: (.*?);"},
            {"Name": "test:loss", "Regex": "Test Average loss: (.*?),"},
            {"Name": "test:accuracy", "Regex": "Test Accuracy: (.*?)%;"},
        ],
        enable_sagemaker_metrics=True,
    )

    cnn_training_job_name = "cnn-training-job-{}".format(int(time.time()))

    # Associate the estimator with the Experiment and Trial
    estimator.fit(
        inputs={"training": inputs},
        job_name=cnn_training_job_name,
        experiment_config={
            "TrialName": cnn_trial.trial_name,
            "TrialComponentDisplayName": "Training",
        },
        wait=True,
    )

    # Wait two seconds before dispatching the next training job
    time.sleep(2)
INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: latest.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: cnn-training-job-1660168481
2022-08-10 21:54:41 Starting - Starting the training job...
2022-08-10 21:55:08 Starting - Preparing the instances for trainingProfilerReport-1660168481: InProgress
.........
2022-08-10 21:56:25 Downloading - Downloading input data........bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2022-08-10 21:57:59,050 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training
2022-08-10 21:57:59,054 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
2022-08-10 21:57:59,066 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
2022-08-10 21:57:59,067 sagemaker_pytorch_container.training INFO     Invoking user training script.
2022-08-10 21:57:59,746 sagemaker-containers INFO     Module mnist does not provide a setup.py. 
Generating setup.py
2022-08-10 21:57:59,746 sagemaker-containers INFO     Generating setup.cfg
2022-08-10 21:57:59,747 sagemaker-containers INFO     Generating MANIFEST.in
2022-08-10 21:57:59,747 sagemaker-containers INFO     Installing module with the following command:
/usr/bin/python -m pip install . 

2022-08-10 21:58:06 Training - Training image download completed. Training in progress.Processing /opt/ml/code
Building wheels for collected packages: mnist
  Running setup.py bdist_wheel for mnist: started
  Running setup.py bdist_wheel for mnist: finished with status 'done'
  Stored in directory: /tmp/pip-ephem-wheel-cache-el4o4mci/wheels/35/24/16/37574d11bf9bde50616c67372a334f94fa8356bc7164af8ca3
Successfully built mnist
Installing collected packages: mnist
Successfully installed mnist-1.0.0
You are using pip version 18.1, however version 21.3.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
2022-08-10 21:58:02,417 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
2022-08-10 21:58:02,432 sagemaker-containers INFO     Invoking user script
Training Env:
{
    "additional_framework_parameters": {},
    "channel_input_dirs": {
        "training": "/opt/ml/input/data/training"
    },
    "current_host": "algo-1",
    "framework_module": "sagemaker_pytorch_container.training:main",
    "hosts": [
        "algo-1"
    ],
    "hyperparameters": {
        "backend": "gloo",
        "dropout": 0.2,
        "epochs": 2,
        "hidden_channels": 2,
        "kernel_size": 5,
        "optimizer": "sgd"
    },
    "input_config_dir": "/opt/ml/input/config",
    "input_data_config": {
        "training": {
            "TrainingInputMode": "File",
            "S3DistributionType": "FullyReplicated",
            "RecordWrapperType": "None"
        }
    },
    "input_dir": "/opt/ml/input",
    "is_master": true,
    "job_name": "cnn-training-job-1660168481",
    "log_level": 20,
    "master_hostname": "algo-1",
    "model_dir": "/opt/ml/model",
    "module_dir": "s3://sagemaker-us-west-2-000000000000/cnn-training-job-1660168481/source/sourcedir.tar.gz",
    "module_name": "mnist",
    "network_interface_name": "eth0",
    "num_cpus": 4,
    "num_gpus": 0,
    "output_data_dir": "/opt/ml/output/data",
    "output_dir": "/opt/ml/output",
    "output_intermediate_dir": "/opt/ml/output/intermediate",
    "resource_config": {
        "current_host": "algo-1",
        "current_instance_type": "ml.c4.xlarge",
        "current_group_name": "homogeneousCluster",
        "hosts": [
            "algo-1"
        ],
        "instance_groups": [
            {
                "instance_group_name": "homogeneousCluster",
                "instance_type": "ml.c4.xlarge",
                "hosts": [
                    "algo-1"
                ]
            }
        ],
        "network_interface_name": "eth0"
    },
    "user_entry_point": "mnist.py"
}
Environment variables:
SM_HOSTS=["algo-1"]
SM_NETWORK_INTERFACE_NAME=eth0
SM_HPS={"backend":"gloo","dropout":0.2,"epochs":2,"hidden_channels":2,"kernel_size":5,"optimizer":"sgd"}
SM_USER_ENTRY_POINT=mnist.py
SM_FRAMEWORK_PARAMS={}
SM_RESOURCE_CONFIG={"current_group_name":"homogeneousCluster","current_host":"algo-1","current_instance_type":"ml.c4.xlarge","hosts":["algo-1"],"instance_groups":[{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.c4.xlarge"}],"network_interface_name":"eth0"}
SM_INPUT_DATA_CONFIG={"training":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}}
SM_OUTPUT_DATA_DIR=/opt/ml/output/data
SM_CHANNELS=["training"]
SM_CURRENT_HOST=algo-1
SM_MODULE_NAME=mnist
SM_LOG_LEVEL=20
SM_FRAMEWORK_MODULE=sagemaker_pytorch_container.training:main
SM_INPUT_DIR=/opt/ml/input
SM_INPUT_CONFIG_DIR=/opt/ml/input/config
SM_OUTPUT_DIR=/opt/ml/output
SM_NUM_CPUS=4
SM_NUM_GPUS=0
SM_MODEL_DIR=/opt/ml/model
SM_MODULE_DIR=s3://sagemaker-us-west-2-000000000000/cnn-training-job-1660168481/source/sourcedir.tar.gz
SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{"training":"/opt/ml/input/data/training"},"current_host":"algo-1","framework_module":"sagemaker_pytorch_container.training:main","hosts":["algo-1"],"hyperparameters":{"backend":"gloo","dropout":0.2,"epochs":2,"hidden_channels":2,"kernel_size":5,"optimizer":"sgd"},"input_config_dir":"/opt/ml/input/config","input_data_config":{"training":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}},"input_dir":"/opt/ml/input","is_master":true,"job_name":"cnn-training-job-1660168481","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://sagemaker-us-west-2-000000000000/cnn-training-job-1660168481/source/sourcedir.tar.gz","module_name":"mnist","network_interface_name":"eth0","num_cpus":4,"num_gpus":0,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_group_name":"homogeneousCluster","current_host":"algo-1","current_instance_type":"ml.c4.xlarge","hosts":["algo-1"],"instance_groups":[{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.c4.xlarge"}],"network_interface_name":"eth0"},"user_entry_point":"mnist.py"}
SM_USER_ARGS=["--backend","gloo","--dropout","0.2","--epochs","2","--hidden_channels","2","--kernel_size","5","--optimizer","sgd"]
SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
SM_CHANNEL_TRAINING=/opt/ml/input/data/training
SM_HP_BACKEND=gloo
SM_HP_DROPOUT=0.2
SM_HP_EPOCHS=2
SM_HP_HIDDEN_CHANNELS=2
SM_HP_KERNEL_SIZE=5
SM_HP_OPTIMIZER=sgd
PYTHONPATH=/usr/local/bin:/usr/lib/python36.zip:/usr/lib/python3.6:/usr/lib/python3.6/lib-dynload:/usr/local/lib/python3.6/dist-packages:/usr/lib/python3/dist-packages
Invoking script with the following command:
/usr/bin/python -m mnist --backend gloo --dropout 0.2 --epochs 2 --hidden_channels 2 --kernel_size 5 --optimizer sgd
Distributed training - False
Number of gpus available - 0
Get train data loader
Get test data loader
Processes 60000/60000 (100%) of train data
Processes 10000/10000 (100%) of test data
Train Epoch: 1 [6400/60000 (11%)], Train Loss: 1.617049;
Train Epoch: 1 [12800/60000 (21%)], Train Loss: 0.941270;
Train Epoch: 1 [19200/60000 (32%)], Train Loss: 0.843992;
Train Epoch: 1 [25600/60000 (43%)], Train Loss: 0.432059;
Train Epoch: 1 [32000/60000 (53%)], Train Loss: 0.464780;
Train Epoch: 1 [38400/60000 (64%)], Train Loss: 0.322854;
Train Epoch: 1 [44800/60000 (75%)], Train Loss: 0.351526;
Train Epoch: 1 [51200/60000 (85%)], Train Loss: 0.389307;
Train Epoch: 1 [57600/60000 (96%)], Train Loss: 0.375780;
Test Average loss: 0.1852, Test Accuracy: 95%;
Train Epoch: 2 [6400/60000 (11%)], Train Loss: 0.275706;
Train Epoch: 2 [12800/60000 (21%)], Train Loss: 0.292543;
Train Epoch: 2 [19200/60000 (32%)], Train Loss: 0.244455;
Train Epoch: 2 [25600/60000 (43%)], Train Loss: 0.283167;
Train Epoch: 2 [32000/60000 (53%)], Train Loss: 0.279576;
Train Epoch: 2 [38400/60000 (64%)], Train Loss: 0.341436;
Train Epoch: 2 [44800/60000 (75%)], Train Loss: 0.414407;
Train Epoch: 2 [51200/60000 (85%)], Train Loss: 0.193495;
Train Epoch: 2 [57600/60000 (96%)], Train Loss: 0.157259;
Test Average loss: 0.1158, Test Accuracy: 97%;
Saving the model.
2022-08-10 21:58:56,590 sagemaker-containers INFO     Reporting training SUCCESS

2022-08-10 21:59:06 Uploading - Uploading generated training model
2022-08-10 21:59:26 Completed - Training job completed
ProfilerReport-1660168481: NoIssuesFound
Training seconds: 173
Billable seconds: 173
INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: latest.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: cnn-training-job-1660168801
2022-08-10 22:00:01 Starting - Starting the training job...
2022-08-10 22:00:25 Starting - Preparing the instances for trainingProfilerReport-1660168801: InProgress
.........
2022-08-10 22:01:46 Downloading - Downloading input data........bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2022-08-10 22:03:13,258 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training
2022-08-10 22:03:13,262 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
2022-08-10 22:03:13,282 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
2022-08-10 22:03:13,283 sagemaker_pytorch_container.training INFO     Invoking user training script.
2022-08-10 22:03:13,783 sagemaker-containers INFO     Module mnist does not provide a setup.py. 
Generating setup.py
2022-08-10 22:03:13,783 sagemaker-containers INFO     Generating setup.cfg
2022-08-10 22:03:13,783 sagemaker-containers INFO     Generating MANIFEST.in
2022-08-10 22:03:13,784 sagemaker-containers INFO     Installing module with the following command:
/usr/bin/python -m pip install . 
Processing /opt/ml/code
Building wheels for collected packages: mnist
  Running setup.py bdist_wheel for mnist: started
  Running setup.py bdist_wheel for mnist: finished with status 'done'
  Stored in directory: /tmp/pip-ephem-wheel-cache-_a37ceja/wheels/35/24/16/37574d11bf9bde50616c67372a334f94fa8356bc7164af8ca3
Successfully built mnist
Installing collected packages: mnist
Successfully installed mnist-1.0.0
You are using pip version 18.1, however version 21.3.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
2022-08-10 22:03:16,072 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
2022-08-10 22:03:16,089 sagemaker-containers INFO     Invoking user script
Training Env:
{
    "additional_framework_parameters": {},
    "channel_input_dirs": {
        "training": "/opt/ml/input/data/training"
    },
    "current_host": "algo-1",
    "framework_module": "sagemaker_pytorch_container.training:main",
    "hosts": [
        "algo-1"
    ],
    "hyperparameters": {
        "backend": "gloo",
        "dropout": 0.2,
        "epochs": 2,
        "hidden_channels": 5,
        "kernel_size": 5,
        "optimizer": "sgd"
    },
    "input_config_dir": "/opt/ml/input/config",
    "input_data_config": {
        "training": {
            "TrainingInputMode": "File",
            "S3DistributionType": "FullyReplicated",
            "RecordWrapperType": "None"
        }
    },
    "input_dir": "/opt/ml/input",
    "is_master": true,
    "job_name": "cnn-training-job-1660168801",
    "log_level": 20,
    "master_hostname": "algo-1",
    "model_dir": "/opt/ml/model",
    "module_dir": "s3://sagemaker-us-west-2-000000000000/cnn-training-job-1660168801/source/sourcedir.tar.gz",
    "module_name": "mnist",
    "network_interface_name": "eth0",
    "num_cpus": 4,
    "num_gpus": 0,
    "output_data_dir": "/opt/ml/output/data",
    "output_dir": "/opt/ml/output",
    "output_intermediate_dir": "/opt/ml/output/intermediate",
    "resource_config": {
        "current_host": "algo-1",
        "current_instance_type": "ml.c4.xlarge",
        "current_group_name": "homogeneousCluster",
        "hosts": [
            "algo-1"
        ],
        "instance_groups": [
            {
                "instance_group_name": "homogeneousCluster",
                "instance_type": "ml.c4.xlarge",
                "hosts": [
                    "algo-1"
                ]
            }
        ],
        "network_interface_name": "eth0"
    },
    "user_entry_point": "mnist.py"
}
Environment variables:
SM_HOSTS=["algo-1"]
SM_NETWORK_INTERFACE_NAME=eth0
SM_HPS={"backend":"gloo","dropout":0.2,"epochs":2,"hidden_channels":5,"kernel_size":5,"optimizer":"sgd"}
SM_USER_ENTRY_POINT=mnist.py
SM_FRAMEWORK_PARAMS={}
SM_RESOURCE_CONFIG={"current_group_name":"homogeneousCluster","current_host":"algo-1","current_instance_type":"ml.c4.xlarge","hosts":["algo-1"],"instance_groups":[{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.c4.xlarge"}],"network_interface_name":"eth0"}
SM_INPUT_DATA_CONFIG={"training":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}}
SM_OUTPUT_DATA_DIR=/opt/ml/output/data
SM_CHANNELS=["training"]
SM_CURRENT_HOST=algo-1
SM_MODULE_NAME=mnist
SM_LOG_LEVEL=20
SM_FRAMEWORK_MODULE=sagemaker_pytorch_container.training:main
SM_INPUT_DIR=/opt/ml/input
SM_INPUT_CONFIG_DIR=/opt/ml/input/config
SM_OUTPUT_DIR=/opt/ml/output
SM_NUM_CPUS=4
SM_NUM_GPUS=0
SM_MODEL_DIR=/opt/ml/model
SM_MODULE_DIR=s3://sagemaker-us-west-2-000000000000/cnn-training-job-1660168801/source/sourcedir.tar.gz
SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{"training":"/opt/ml/input/data/training"},"current_host":"algo-1","framework_module":"sagemaker_pytorch_container.training:main","hosts":["algo-1"],"hyperparameters":{"backend":"gloo","dropout":0.2,"epochs":2,"hidden_channels":5,"kernel_size":5,"optimizer":"sgd"},"input_config_dir":"/opt/ml/input/config","input_data_config":{"training":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}},"input_dir":"/opt/ml/input","is_master":true,"job_name":"cnn-training-job-1660168801","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://sagemaker-us-west-2-000000000000/cnn-training-job-1660168801/source/sourcedir.tar.gz","module_name":"mnist","network_interface_name":"eth0","num_cpus":4,"num_gpus":0,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_group_name":"homogeneousCluster","current_host":"algo-1","current_instance_type":"ml.c4.xlarge","hosts":["algo-1"],"instance_groups":[{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.c4.xlarge"}],"network_interface_name":"eth0"},"user_entry_point":"mnist.py"}
SM_USER_ARGS=["--backend","gloo","--dropout","0.2","--epochs","2","--hidden_channels","5","--kernel_size","5","--optimizer","sgd"]
SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
SM_CHANNEL_TRAINING=/opt/ml/input/data/training
SM_HP_BACKEND=gloo
SM_HP_DROPOUT=0.2
SM_HP_EPOCHS=2
SM_HP_HIDDEN_CHANNELS=5
SM_HP_KERNEL_SIZE=5
SM_HP_OPTIMIZER=sgd
PYTHONPATH=/usr/local/bin:/usr/lib/python36.zip:/usr/lib/python3.6:/usr/lib/python3.6/lib-dynload:/usr/local/lib/python3.6/dist-packages:/usr/lib/python3/dist-packages
Invoking script with the following command:
/usr/bin/python -m mnist --backend gloo --dropout 0.2 --epochs 2 --hidden_channels 5 --kernel_size 5 --optimizer sgd
Distributed training - False
Number of gpus available - 0
Get train data loader
Get test data loader
Processes 60000/60000 (100%) of train data
Processes 10000/10000 (100%) of test data

2022-08-10 22:03:26 Training - Training image download completed. Training in progress.Train Epoch: 1 [6400/60000 (11%)], Train Loss: 1.698200;
Train Epoch: 1 [12800/60000 (21%)], Train Loss: 0.994831;
Train Epoch: 1 [19200/60000 (32%)], Train Loss: 0.611539;
Train Epoch: 1 [25600/60000 (43%)], Train Loss: 0.648925;
Train Epoch: 1 [32000/60000 (53%)], Train Loss: 0.571486;
Train Epoch: 1 [38400/60000 (64%)], Train Loss: 0.791933;
Train Epoch: 1 [44800/60000 (75%)], Train Loss: 0.438099;
Train Epoch: 1 [51200/60000 (85%)], Train Loss: 0.549112;
Train Epoch: 1 [57600/60000 (96%)], Train Loss: 0.480673;
Test Average loss: 0.1914, Test Accuracy: 94%;
Train Epoch: 2 [6400/60000 (11%)], Train Loss: 0.297167;
Train Epoch: 2 [12800/60000 (21%)], Train Loss: 0.364907;
Train Epoch: 2 [19200/60000 (32%)], Train Loss: 0.268553;
Train Epoch: 2 [25600/60000 (43%)], Train Loss: 0.272427;
Train Epoch: 2 [32000/60000 (53%)], Train Loss: 0.382764;
Train Epoch: 2 [38400/60000 (64%)], Train Loss: 0.482188;
Train Epoch: 2 [44800/60000 (75%)], Train Loss: 0.203590;
Train Epoch: 2 [51200/60000 (85%)], Train Loss: 0.445356;
Train Epoch: 2 [57600/60000 (96%)], Train Loss: 0.197844;
Test Average loss: 0.1157, Test Accuracy: 96%;
Saving the model.
2022-08-10 22:04:10,665 sagemaker-containers INFO     Reporting training SUCCESS

2022-08-10 22:04:31 Uploading - Uploading generated training model
2022-08-10 22:04:31 Completed - Training job completed
Training seconds: 169
Billable seconds: 169
INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: latest.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: cnn-training-job-1660169089
2022-08-10 22:04:49 Starting - Starting the training job...
2022-08-10 22:05:06 Starting - Preparing the instances for trainingProfilerReport-1660169089: InProgress
.........
2022-08-10 22:06:32 Downloading - Downloading input data......
2022-08-10 22:07:48 Training - Training image download completed. Training in progress..bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2022-08-10 22:07:52,347 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training
2022-08-10 22:07:52,351 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
2022-08-10 22:07:52,367 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
2022-08-10 22:07:52,368 sagemaker_pytorch_container.training INFO     Invoking user training script.
2022-08-10 22:07:52,871 sagemaker-containers INFO     Module mnist does not provide a setup.py. 
Generating setup.py
2022-08-10 22:07:52,872 sagemaker-containers INFO     Generating setup.cfg
2022-08-10 22:07:52,872 sagemaker-containers INFO     Generating MANIFEST.in
2022-08-10 22:07:52,872 sagemaker-containers INFO     Installing module with the following command:
/usr/bin/python -m pip install . 
Processing /opt/ml/code
Building wheels for collected packages: mnist
  Running setup.py bdist_wheel for mnist: started
  Running setup.py bdist_wheel for mnist: finished with status 'done'
  Stored in directory: /tmp/pip-ephem-wheel-cache-ju5sp2sc/wheels/35/24/16/37574d11bf9bde50616c67372a334f94fa8356bc7164af8ca3
Successfully built mnist
Installing collected packages: mnist
Successfully installed mnist-1.0.0
You are using pip version 18.1, however version 21.3.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
2022-08-10 22:07:55,254 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
2022-08-10 22:07:55,269 sagemaker-containers INFO     Invoking user script
Training Env:
{
    "additional_framework_parameters": {},
    "channel_input_dirs": {
        "training": "/opt/ml/input/data/training"
    },
    "current_host": "algo-1",
    "framework_module": "sagemaker_pytorch_container.training:main",
    "hosts": [
        "algo-1"
    ],
    "hyperparameters": {
        "backend": "gloo",
        "dropout": 0.2,
        "epochs": 2,
        "hidden_channels": 10,
        "kernel_size": 5,
        "optimizer": "sgd"
    },
    "input_config_dir": "/opt/ml/input/config",
    "input_data_config": {
        "training": {
            "TrainingInputMode": "File",
            "S3DistributionType": "FullyReplicated",
            "RecordWrapperType": "None"
        }
    },
    "input_dir": "/opt/ml/input",
    "is_master": true,
    "job_name": "cnn-training-job-1660169089",
    "log_level": 20,
    "master_hostname": "algo-1",
    "model_dir": "/opt/ml/model",
    "module_dir": "s3://sagemaker-us-west-2-000000000000/cnn-training-job-1660169089/source/sourcedir.tar.gz",
    "module_name": "mnist",
    "network_interface_name": "eth0",
    "num_cpus": 4,
    "num_gpus": 0,
    "output_data_dir": "/opt/ml/output/data",
    "output_dir": "/opt/ml/output",
    "output_intermediate_dir": "/opt/ml/output/intermediate",
    "resource_config": {
        "current_host": "algo-1",
        "current_instance_type": "ml.c4.xlarge",
        "current_group_name": "homogeneousCluster",
        "hosts": [
            "algo-1"
        ],
        "instance_groups": [
            {
                "instance_group_name": "homogeneousCluster",
                "instance_type": "ml.c4.xlarge",
                "hosts": [
                    "algo-1"
                ]
            }
        ],
        "network_interface_name": "eth0"
    },
    "user_entry_point": "mnist.py"
}
Environment variables:
SM_HOSTS=["algo-1"]
SM_NETWORK_INTERFACE_NAME=eth0
SM_HPS={"backend":"gloo","dropout":0.2,"epochs":2,"hidden_channels":10,"kernel_size":5,"optimizer":"sgd"}
SM_USER_ENTRY_POINT=mnist.py
SM_FRAMEWORK_PARAMS={}
SM_RESOURCE_CONFIG={"current_group_name":"homogeneousCluster","current_host":"algo-1","current_instance_type":"ml.c4.xlarge","hosts":["algo-1"],"instance_groups":[{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.c4.xlarge"}],"network_interface_name":"eth0"}
SM_INPUT_DATA_CONFIG={"training":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}}
SM_OUTPUT_DATA_DIR=/opt/ml/output/data
SM_CHANNELS=["training"]
SM_CURRENT_HOST=algo-1
SM_MODULE_NAME=mnist
SM_LOG_LEVEL=20
SM_FRAMEWORK_MODULE=sagemaker_pytorch_container.training:main
SM_INPUT_DIR=/opt/ml/input
SM_INPUT_CONFIG_DIR=/opt/ml/input/config
SM_OUTPUT_DIR=/opt/ml/output
SM_NUM_CPUS=4
SM_NUM_GPUS=0
SM_MODEL_DIR=/opt/ml/model
SM_MODULE_DIR=s3://sagemaker-us-west-2-000000000000/cnn-training-job-1660169089/source/sourcedir.tar.gz
SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{"training":"/opt/ml/input/data/training"},"current_host":"algo-1","framework_module":"sagemaker_pytorch_container.training:main","hosts":["algo-1"],"hyperparameters":{"backend":"gloo","dropout":0.2,"epochs":2,"hidden_channels":10,"kernel_size":5,"optimizer":"sgd"},"input_config_dir":"/opt/ml/input/config","input_data_config":{"training":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}},"input_dir":"/opt/ml/input","is_master":true,"job_name":"cnn-training-job-1660169089","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://sagemaker-us-west-2-000000000000/cnn-training-job-1660169089/source/sourcedir.tar.gz","module_name":"mnist","network_interface_name":"eth0","num_cpus":4,"num_gpus":0,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_group_name":"homogeneousCluster","current_host":"algo-1","current_instance_type":"ml.c4.xlarge","hosts":["algo-1"],"instance_groups":[{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.c4.xlarge"}],"network_interface_name":"eth0"},"user_entry_point":"mnist.py"}
SM_USER_ARGS=["--backend","gloo","--dropout","0.2","--epochs","2","--hidden_channels","10","--kernel_size","5","--optimizer","sgd"]
SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
SM_CHANNEL_TRAINING=/opt/ml/input/data/training
SM_HP_BACKEND=gloo
SM_HP_DROPOUT=0.2
SM_HP_EPOCHS=2
SM_HP_HIDDEN_CHANNELS=10
SM_HP_KERNEL_SIZE=5
SM_HP_OPTIMIZER=sgd
PYTHONPATH=/usr/local/bin:/usr/lib/python36.zip:/usr/lib/python3.6:/usr/lib/python3.6/lib-dynload:/usr/local/lib/python3.6/dist-packages:/usr/lib/python3/dist-packages
Invoking script with the following command:
/usr/bin/python -m mnist --backend gloo --dropout 0.2 --epochs 2 --hidden_channels 10 --kernel_size 5 --optimizer sgd
Distributed training - False
Number of gpus available - 0
Get train data loader
Get test data loader
Processes 60000/60000 (100%) of train data
Processes 10000/10000 (100%) of test data
Train Epoch: 1 [6400/60000 (11%)], Train Loss: 1.695285;
Train Epoch: 1 [12800/60000 (21%)], Train Loss: 0.928432;
Train Epoch: 1 [19200/60000 (32%)], Train Loss: 0.702160;
Train Epoch: 1 [25600/60000 (43%)], Train Loss: 0.442871;
Train Epoch: 1 [32000/60000 (53%)], Train Loss: 0.413667;
Train Epoch: 1 [38400/60000 (64%)], Train Loss: 0.501132;
Train Epoch: 1 [44800/60000 (75%)], Train Loss: 0.383585;
Train Epoch: 1 [51200/60000 (85%)], Train Loss: 0.328490;
Train Epoch: 1 [57600/60000 (96%)], Train Loss: 0.396089;
Test Average loss: 0.1679, Test Accuracy: 95%;
Train Epoch: 2 [6400/60000 (11%)], Train Loss: 0.603119;
Train Epoch: 2 [12800/60000 (21%)], Train Loss: 0.229334;
Train Epoch: 2 [19200/60000 (32%)], Train Loss: 0.281790;
Train Epoch: 2 [25600/60000 (43%)], Train Loss: 0.376957;
Train Epoch: 2 [32000/60000 (53%)], Train Loss: 0.412861;
Train Epoch: 2 [38400/60000 (64%)], Train Loss: 0.200810;
Train Epoch: 2 [44800/60000 (75%)], Train Loss: 0.233049;
Train Epoch: 2 [51200/60000 (85%)], Train Loss: 0.319483;
Train Epoch: 2 [57600/60000 (96%)], Train Loss: 0.210030;
Test Average loss: 0.1060, Test Accuracy: 97%;
Saving the model.
2022-08-10 22:09:01,629 sagemaker-containers INFO     Reporting training SUCCESS

2022-08-10 22:09:19 Uploading - Uploading generated training model
2022-08-10 22:09:19 Completed - Training job completed
Training seconds: 172
Billable seconds: 172
INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: latest.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: cnn-training-job-1660169377
2022-08-10 22:09:37 Starting - Starting the training job...
2022-08-10 22:10:02 Starting - Preparing the instances for trainingProfilerReport-1660169377: InProgress
.........
2022-08-10 22:11:22 Downloading - Downloading input data........bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2022-08-10 22:12:43,815 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training
2022-08-10 22:12:43,818 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
2022-08-10 22:12:43,838 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
2022-08-10 22:12:43,839 sagemaker_pytorch_container.training INFO     Invoking user training script.
2022-08-10 22:12:44,428 sagemaker-containers INFO     Module mnist does not provide a setup.py. 
Generating setup.py
2022-08-10 22:12:44,429 sagemaker-containers INFO     Generating setup.cfg
2022-08-10 22:12:44,429 sagemaker-containers INFO     Generating MANIFEST.in
2022-08-10 22:12:44,429 sagemaker-containers INFO     Installing module with the following command:
/usr/bin/python -m pip install . 
Processing /opt/ml/code
Building wheels for collected packages: mnist
  Running setup.py bdist_wheel for mnist: started
  Running setup.py bdist_wheel for mnist: finished with status 'done'
  Stored in directory: /tmp/pip-ephem-wheel-cache-1cgxrjqy/wheels/35/24/16/37574d11bf9bde50616c67372a334f94fa8356bc7164af8ca3
Successfully built mnist
Installing collected packages: mnist
Successfully installed mnist-1.0.0
You are using pip version 18.1, however version 21.3.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
2022-08-10 22:12:46,794 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
2022-08-10 22:12:46,809 sagemaker-containers INFO     Invoking user script
Training Env:
{
    "additional_framework_parameters": {},
    "channel_input_dirs": {
        "training": "/opt/ml/input/data/training"
    },
    "current_host": "algo-1",
    "framework_module": "sagemaker_pytorch_container.training:main",
    "hosts": [
        "algo-1"
    ],
    "hyperparameters": {
        "backend": "gloo",
        "dropout": 0.2,
        "epochs": 2,
        "hidden_channels": 20,
        "kernel_size": 5,
        "optimizer": "sgd"
    },
    "input_config_dir": "/opt/ml/input/config",
    "input_data_config": {
        "training": {
            "TrainingInputMode": "File",
            "S3DistributionType": "FullyReplicated",
            "RecordWrapperType": "None"
        }
    },
    "input_dir": "/opt/ml/input",
    "is_master": true,
    "job_name": "cnn-training-job-1660169377",
    "log_level": 20,
    "master_hostname": "algo-1",
    "model_dir": "/opt/ml/model",
    "module_dir": "s3://sagemaker-us-west-2-000000000000/cnn-training-job-1660169377/source/sourcedir.tar.gz",
    "module_name": "mnist",
    "network_interface_name": "eth0",
    "num_cpus": 4,
    "num_gpus": 0,
    "output_data_dir": "/opt/ml/output/data",
    "output_dir": "/opt/ml/output",
    "output_intermediate_dir": "/opt/ml/output/intermediate",
    "resource_config": {
        "current_host": "algo-1",
        "current_instance_type": "ml.c4.xlarge",
        "current_group_name": "homogeneousCluster",
        "hosts": [
            "algo-1"
        ],
        "instance_groups": [
            {
                "instance_group_name": "homogeneousCluster",
                "instance_type": "ml.c4.xlarge",
                "hosts": [
                    "algo-1"
                ]
            }
        ],
        "network_interface_name": "eth0"
    },
    "user_entry_point": "mnist.py"
}
Environment variables:
SM_HOSTS=["algo-1"]
SM_NETWORK_INTERFACE_NAME=eth0
SM_HPS={"backend":"gloo","dropout":0.2,"epochs":2,"hidden_channels":20,"kernel_size":5,"optimizer":"sgd"}
SM_USER_ENTRY_POINT=mnist.py
SM_FRAMEWORK_PARAMS={}
SM_RESOURCE_CONFIG={"current_group_name":"homogeneousCluster","current_host":"algo-1","current_instance_type":"ml.c4.xlarge","hosts":["algo-1"],"instance_groups":[{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.c4.xlarge"}],"network_interface_name":"eth0"}
SM_INPUT_DATA_CONFIG={"training":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}}
SM_OUTPUT_DATA_DIR=/opt/ml/output/data
SM_CHANNELS=["training"]
SM_CURRENT_HOST=algo-1
SM_MODULE_NAME=mnist
SM_LOG_LEVEL=20
SM_FRAMEWORK_MODULE=sagemaker_pytorch_container.training:main
SM_INPUT_DIR=/opt/ml/input
SM_INPUT_CONFIG_DIR=/opt/ml/input/config
SM_OUTPUT_DIR=/opt/ml/output
SM_NUM_CPUS=4
SM_NUM_GPUS=0
SM_MODEL_DIR=/opt/ml/model
SM_MODULE_DIR=s3://sagemaker-us-west-2-000000000000/cnn-training-job-1660169377/source/sourcedir.tar.gz
SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{"training":"/opt/ml/input/data/training"},"current_host":"algo-1","framework_module":"sagemaker_pytorch_container.training:main","hosts":["algo-1"],"hyperparameters":{"backend":"gloo","dropout":0.2,"epochs":2,"hidden_channels":20,"kernel_size":5,"optimizer":"sgd"},"input_config_dir":"/opt/ml/input/config","input_data_config":{"training":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}},"input_dir":"/opt/ml/input","is_master":true,"job_name":"cnn-training-job-1660169377","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://sagemaker-us-west-2-000000000000/cnn-training-job-1660169377/source/sourcedir.tar.gz","module_name":"mnist","network_interface_name":"eth0","num_cpus":4,"num_gpus":0,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_group_name":"homogeneousCluster","current_host":"algo-1","current_instance_type":"ml.c4.xlarge","hosts":["algo-1"],"instance_groups":[{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.c4.xlarge"}],"network_interface_name":"eth0"},"user_entry_point":"mnist.py"}
SM_USER_ARGS=["--backend","gloo","--dropout","0.2","--epochs","2","--hidden_channels","20","--kernel_size","5","--optimizer","sgd"]
SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
SM_CHANNEL_TRAINING=/opt/ml/input/data/training
SM_HP_BACKEND=gloo
SM_HP_DROPOUT=0.2
SM_HP_EPOCHS=2
SM_HP_HIDDEN_CHANNELS=20
SM_HP_KERNEL_SIZE=5
SM_HP_OPTIMIZER=sgd
PYTHONPATH=/usr/local/bin:/usr/lib/python36.zip:/usr/lib/python3.6:/usr/lib/python3.6/lib-dynload:/usr/local/lib/python3.6/dist-packages:/usr/lib/python3/dist-packages
Invoking script with the following command:
/usr/bin/python -m mnist --backend gloo --dropout 0.2 --epochs 2 --hidden_channels 20 --kernel_size 5 --optimizer sgd
Distributed training - False
Number of gpus available - 0
Get train data loader
Get test data loader
Processes 60000/60000 (100%) of train data
Processes 10000/10000 (100%) of test data
Train Epoch: 1 [6400/60000 (11%)], Train Loss: 1.621211;
Train Epoch: 1 [12800/60000 (21%)], Train Loss: 1.013725;

2022-08-10 22:13:02 Training - Training image download completed. Training in progress.Train Epoch: 1 [19200/60000 (32%)], Train Loss: 0.799500;
Train Epoch: 1 [25600/60000 (43%)], Train Loss: 0.595724;
Train Epoch: 1 [32000/60000 (53%)], Train Loss: 0.411506;
Train Epoch: 1 [38400/60000 (64%)], Train Loss: 0.176120;
Train Epoch: 1 [44800/60000 (75%)], Train Loss: 0.317887;
Train Epoch: 1 [51200/60000 (85%)], Train Loss: 0.203233;
Train Epoch: 1 [57600/60000 (96%)], Train Loss: 0.264020;
Test Average loss: 0.1407, Test Accuracy: 96%;
Train Epoch: 2 [6400/60000 (11%)], Train Loss: 0.384039;
Train Epoch: 2 [12800/60000 (21%)], Train Loss: 0.200869;
Train Epoch: 2 [19200/60000 (32%)], Train Loss: 0.399078;
Train Epoch: 2 [25600/60000 (43%)], Train Loss: 0.451251;
Train Epoch: 2 [32000/60000 (53%)], Train Loss: 0.263621;
Train Epoch: 2 [38400/60000 (64%)], Train Loss: 0.151691;
Train Epoch: 2 [44800/60000 (75%)], Train Loss: 0.375438;

2022-08-10 22:14:10 Uploading - Uploading generated training modelTrain Epoch: 2 [51200/60000 (85%)], Train Loss: 0.332934;
Train Epoch: 2 [57600/60000 (96%)], Train Loss: 0.227200;
Test Average loss: 0.0965, Test Accuracy: 97%;
Saving the model.
2022-08-10 22:14:05,477 sagemaker-containers INFO     Reporting training SUCCESS

2022-08-10 22:14:23 Completed - Training job completed
Training seconds: 182
Billable seconds: 182
INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: latest.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: cnn-training-job-1660169695
2022-08-10 22:14:56 Starting - Starting the training job...ProfilerReport-1660169695: InProgress
...
2022-08-10 22:15:50 Starting - Preparing the instances for training......
2022-08-10 22:16:50 Downloading - Downloading input data......
2022-08-10 22:17:50 Training - Training image download completed. Training in progress..bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2022-08-10 22:17:50,169 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training
2022-08-10 22:17:50,172 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
2022-08-10 22:17:50,189 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
2022-08-10 22:17:50,191 sagemaker_pytorch_container.training INFO     Invoking user training script.
2022-08-10 22:17:50,553 sagemaker-containers INFO     Module mnist does not provide a setup.py. 
Generating setup.py
2022-08-10 22:17:50,553 sagemaker-containers INFO     Generating setup.cfg
2022-08-10 22:17:50,553 sagemaker-containers INFO     Generating MANIFEST.in
2022-08-10 22:17:50,554 sagemaker-containers INFO     Installing module with the following command:
/usr/bin/python -m pip install . 
Processing /opt/ml/code
Building wheels for collected packages: mnist
  Running setup.py bdist_wheel for mnist: started
  Running setup.py bdist_wheel for mnist: finished with status 'done'
  Stored in directory: /tmp/pip-ephem-wheel-cache-qbwv6yqc/wheels/35/24/16/37574d11bf9bde50616c67372a334f94fa8356bc7164af8ca3
Successfully built mnist
Installing collected packages: mnist
Successfully installed mnist-1.0.0
You are using pip version 18.1, however version 21.3.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
2022-08-10 22:17:52,885 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
2022-08-10 22:17:52,907 sagemaker-containers INFO     Invoking user script
Training Env:
{
    "additional_framework_parameters": {},
    "channel_input_dirs": {
        "training": "/opt/ml/input/data/training"
    },
    "current_host": "algo-1",
    "framework_module": "sagemaker_pytorch_container.training:main",
    "hosts": [
        "algo-1"
    ],
    "hyperparameters": {
        "backend": "gloo",
        "dropout": 0.2,
        "epochs": 2,
        "hidden_channels": 32,
        "kernel_size": 5,
        "optimizer": "sgd"
    },
    "input_config_dir": "/opt/ml/input/config",
    "input_data_config": {
        "training": {
            "TrainingInputMode": "File",
            "S3DistributionType": "FullyReplicated",
            "RecordWrapperType": "None"
        }
    },
    "input_dir": "/opt/ml/input",
    "is_master": true,
    "job_name": "cnn-training-job-1660169695",
    "log_level": 20,
    "master_hostname": "algo-1",
    "model_dir": "/opt/ml/model",
    "module_dir": "s3://sagemaker-us-west-2-000000000000/cnn-training-job-1660169695/source/sourcedir.tar.gz",
    "module_name": "mnist",
    "network_interface_name": "eth0",
    "num_cpus": 4,
    "num_gpus": 0,
    "output_data_dir": "/opt/ml/output/data",
    "output_dir": "/opt/ml/output",
    "output_intermediate_dir": "/opt/ml/output/intermediate",
    "resource_config": {
        "current_host": "algo-1",
        "current_instance_type": "ml.c4.xlarge",
        "current_group_name": "homogeneousCluster",
        "hosts": [
            "algo-1"
        ],
        "instance_groups": [
            {
                "instance_group_name": "homogeneousCluster",
                "instance_type": "ml.c4.xlarge",
                "hosts": [
                    "algo-1"
                ]
            }
        ],
        "network_interface_name": "eth0"
    },
    "user_entry_point": "mnist.py"
}
Environment variables:
SM_HOSTS=["algo-1"]
SM_NETWORK_INTERFACE_NAME=eth0
SM_HPS={"backend":"gloo","dropout":0.2,"epochs":2,"hidden_channels":32,"kernel_size":5,"optimizer":"sgd"}
SM_USER_ENTRY_POINT=mnist.py
SM_FRAMEWORK_PARAMS={}
SM_RESOURCE_CONFIG={"current_group_name":"homogeneousCluster","current_host":"algo-1","current_instance_type":"ml.c4.xlarge","hosts":["algo-1"],"instance_groups":[{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.c4.xlarge"}],"network_interface_name":"eth0"}
SM_INPUT_DATA_CONFIG={"training":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}}
SM_OUTPUT_DATA_DIR=/opt/ml/output/data
SM_CHANNELS=["training"]
SM_CURRENT_HOST=algo-1
SM_MODULE_NAME=mnist
SM_LOG_LEVEL=20
SM_FRAMEWORK_MODULE=sagemaker_pytorch_container.training:main
SM_INPUT_DIR=/opt/ml/input
SM_INPUT_CONFIG_DIR=/opt/ml/input/config
SM_OUTPUT_DIR=/opt/ml/output
SM_NUM_CPUS=4
SM_NUM_GPUS=0
SM_MODEL_DIR=/opt/ml/model
SM_MODULE_DIR=s3://sagemaker-us-west-2-000000000000/cnn-training-job-1660169695/source/sourcedir.tar.gz
SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{"training":"/opt/ml/input/data/training"},"current_host":"algo-1","framework_module":"sagemaker_pytorch_container.training:main","hosts":["algo-1"],"hyperparameters":{"backend":"gloo","dropout":0.2,"epochs":2,"hidden_channels":32,"kernel_size":5,"optimizer":"sgd"},"input_config_dir":"/opt/ml/input/config","input_data_config":{"training":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}},"input_dir":"/opt/ml/input","is_master":true,"job_name":"cnn-training-job-1660169695","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://sagemaker-us-west-2-000000000000/cnn-training-job-1660169695/source/sourcedir.tar.gz","module_name":"mnist","network_interface_name":"eth0","num_cpus":4,"num_gpus":0,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_group_name":"homogeneousCluster","current_host":"algo-1","current_instance_type":"ml.c4.xlarge","hosts":["algo-1"],"instance_groups":[{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.c4.xlarge"}],"network_interface_name":"eth0"},"user_entry_point":"mnist.py"}
SM_USER_ARGS=["--backend","gloo","--dropout","0.2","--epochs","2","--hidden_channels","32","--kernel_size","5","--optimizer","sgd"]
SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
SM_CHANNEL_TRAINING=/opt/ml/input/data/training
SM_HP_BACKEND=gloo
SM_HP_DROPOUT=0.2
SM_HP_EPOCHS=2
SM_HP_HIDDEN_CHANNELS=32
SM_HP_KERNEL_SIZE=5
SM_HP_OPTIMIZER=sgd
PYTHONPATH=/usr/local/bin:/usr/lib/python36.zip:/usr/lib/python3.6:/usr/lib/python3.6/lib-dynload:/usr/local/lib/python3.6/dist-packages:/usr/lib/python3/dist-packages
Invoking script with the following command:
/usr/bin/python -m mnist --backend gloo --dropout 0.2 --epochs 2 --hidden_channels 32 --kernel_size 5 --optimizer sgd
Distributed training - False
Number of gpus available - 0
Get train data loader
Get test data loader
Processes 60000/60000 (100%) of train data
Processes 10000/10000 (100%) of test data
Train Epoch: 1 [6400/60000 (11%)], Train Loss: 1.461214;
Train Epoch: 1 [12800/60000 (21%)], Train Loss: 0.827845;
Train Epoch: 1 [19200/60000 (32%)], Train Loss: 0.747446;
Train Epoch: 1 [25600/60000 (43%)], Train Loss: 0.516501;
Train Epoch: 1 [32000/60000 (53%)], Train Loss: 0.465859;
Train Epoch: 1 [38400/60000 (64%)], Train Loss: 0.380938;
Train Epoch: 1 [44800/60000 (75%)], Train Loss: 0.451455;
Train Epoch: 1 [51200/60000 (85%)], Train Loss: 0.202041;
Train Epoch: 1 [57600/60000 (96%)], Train Loss: 0.379481;
Test Average loss: 0.1531, Test Accuracy: 95%;
Train Epoch: 2 [6400/60000 (11%)], Train Loss: 0.183234;
Train Epoch: 2 [12800/60000 (21%)], Train Loss: 0.523121;
Train Epoch: 2 [19200/60000 (32%)], Train Loss: 0.199367;
Train Epoch: 2 [25600/60000 (43%)], Train Loss: 0.200822;
Train Epoch: 2 [32000/60000 (53%)], Train Loss: 0.377310;
Train Epoch: 2 [38400/60000 (64%)], Train Loss: 0.227790;
Train Epoch: 2 [44800/60000 (75%)], Train Loss: 0.642007;
Train Epoch: 2 [51200/60000 (85%)], Train Loss: 0.365126;
Train Epoch: 2 [57600/60000 (96%)], Train Loss: 0.101494;
Test Average loss: 0.1012, Test Accuracy: 97%;
Saving the model.
2022-08-10 22:19:25,169 sagemaker-containers INFO     Reporting training SUCCESS

2022-08-10 22:19:51 Uploading - Uploading generated training model
2022-08-10 22:19:51 Completed - Training job completed
ProfilerReport-1660169695: NoIssuesFound
Training seconds: 183
Billable seconds: 183

Compare the model training runs for an experiment

Now we use the analytics capabilities of the Experiments SDK to query and compare the training runs for identifying the best model produced by our experiment. You can retrieve trial components by using a search expression.

Some Simple Analyses

[16]:
search_expression = {
    "Filters": [
        {
            "Name": "DisplayName",
            "Operator": "Equals",
            "Value": "Training",
        }
    ],
}
[17]:
trial_component_analytics = ExperimentAnalytics(
    sagemaker_session=Session(sess, sm),
    experiment_name=mnist_experiment.experiment_name,
    search_expression=search_expression,
    sort_by="metrics.test:accuracy.max",
    sort_order="Descending",
    metric_names=["test:accuracy"],
    parameter_names=["hidden_channels", "epochs", "dropout", "optimizer"],
)
[18]:
trial_component_analytics.dataframe()
[18]:
TrialComponentName DisplayName SourceArn dropout epochs hidden_channels optimizer test:accuracy - Min test:accuracy - Max test:accuracy - Avg ... test:accuracy - Last test:accuracy - Count training - MediaType training - Value SageMaker.DebugHookOutput - MediaType SageMaker.DebugHookOutput - Value SageMaker.ModelArtifact - MediaType SageMaker.ModelArtifact - Value Trials Experiments
0 cnn-training-job-1660169695-aws-training-job Training arn:aws:sagemaker:us-west-2:000000000000:train... 0.2 2.0 32.0 "sgd" 95.0 97.0 96.0 ... 97.0 2 None s3://sagemaker-us-west-2-000000000000/DEMO-mnist None s3://sagemaker-us-west-2-000000000000/ None s3://sagemaker-us-west-2-000000000000/cnn-trai... [cnn-training-job-32-hidden-channels-1660169695] [mnist-hand-written-digits-classification-1660...
1 cnn-training-job-1660168481-aws-training-job Training arn:aws:sagemaker:us-west-2:000000000000:train... 0.2 2.0 2.0 "sgd" 95.0 97.0 96.0 ... 97.0 2 None s3://sagemaker-us-west-2-000000000000/DEMO-mnist None s3://sagemaker-us-west-2-000000000000/ None s3://sagemaker-us-west-2-000000000000/cnn-trai... [cnn-training-job-2-hidden-channels-1660168480] [mnist-hand-written-digits-classification-1660...
2 cnn-training-job-1660169089-aws-training-job Training arn:aws:sagemaker:us-west-2:000000000000:train... 0.2 2.0 10.0 "sgd" 95.0 97.0 96.0 ... 97.0 2 None s3://sagemaker-us-west-2-000000000000/DEMO-mnist None s3://sagemaker-us-west-2-000000000000/ None s3://sagemaker-us-west-2-000000000000/cnn-trai... [cnn-training-job-10-hidden-channels-1660169089] [mnist-hand-written-digits-classification-1660...
3 cnn-training-job-1660169377-aws-training-job Training arn:aws:sagemaker:us-west-2:000000000000:train... 0.2 2.0 20.0 "sgd" 96.0 97.0 96.5 ... 97.0 2 None s3://sagemaker-us-west-2-000000000000/DEMO-mnist None s3://sagemaker-us-west-2-000000000000/ None s3://sagemaker-us-west-2-000000000000/cnn-trai... [cnn-training-job-20-hidden-channels-1660169377] [mnist-hand-written-digits-classification-1660...
4 cnn-training-job-1660168801-aws-training-job Training arn:aws:sagemaker:us-west-2:000000000000:train... 0.2 2.0 5.0 "sgd" 94.0 96.0 95.0 ... 96.0 2 None s3://sagemaker-us-west-2-000000000000/DEMO-mnist None s3://sagemaker-us-west-2-000000000000/ None s3://sagemaker-us-west-2-000000000000/cnn-trai... [cnn-training-job-5-hidden-channels-1660168800] [mnist-hand-written-digits-classification-1660...

5 rows × 21 columns

To isolate and measure the impact of change in hidden channels on model accuracy, we vary the number of hidden channel and fix the value for other hyperparameters.

Next let’s look at an example of tracing the lineage of a model by accessing the data tracked by SageMaker Experiments for the cnn-training-job-2-hidden-channels trial.

[19]:
lineage_table = ExperimentAnalytics(
    sagemaker_session=Session(sess, sm),
    search_expression={
        "Filters": [
            {
                "Name": "Parents.TrialName",
                "Operator": "Equals",
                "Value": hidden_channel_trial_name_map[2],
            }
        ]
    },
    sort_by="CreationTime",
    sort_order="Ascending",
)
[20]:
lineage_table.dataframe()
[20]:
TrialComponentName DisplayName normalization_mean normalization_std mnist-dataset - MediaType mnist-dataset - Value Trials Experiments SourceArn SageMaker.ImageUri ... train:loss - Avg train:loss - StdDev train:loss - Last train:loss - Count training - MediaType training - Value SageMaker.DebugHookOutput - MediaType SageMaker.DebugHookOutput - Value SageMaker.ModelArtifact - MediaType SageMaker.ModelArtifact - Value
0 TrialComponent-2022-08-10-215425-svma Preprocessing 0.1307 0.3081 s3/uri s3://sagemaker-us-west-2-000000000000/DEMO-mnist [cnn-training-job-10-hidden-channels-166016908... [mnist-hand-written-digits-classification-1660... NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 cnn-training-job-1660168481-aws-training-job Training NaN NaN NaN NaN [cnn-training-job-2-hidden-channels-1660168480] [mnist-hand-written-digits-classification-1660... arn:aws:sagemaker:us-west-2:000000000000:train... 520713654638.dkr.ecr.us-west-2.amazonaws.com/s... ... 0.456703 0.352488 0.157259 18.0 NaN s3://sagemaker-us-west-2-000000000000/DEMO-mnist NaN s3://sagemaker-us-west-2-000000000000/ NaN s3://sagemaker-us-west-2-000000000000/cnn-trai...

2 rows × 48 columns

Push best training job model to model registry

Now we take the best model and push it to model registry.

Step 1: Create a model package group.

[21]:
import time

model_package_group_name = "mnist-handwritten-digit-claissification" + str(round(time.time()))
model_package_group_input_dict = {
    "ModelPackageGroupName": model_package_group_name,
    "ModelPackageGroupDescription": "Sample model package group",
}

create_model_package_group_response = sm.create_model_package_group(
    **model_package_group_input_dict
)
model_package_arn = create_model_package_group_response["ModelPackageGroupArn"]

print(f"ModelPackageGroup Arn : {model_package_arn}")
ModelPackageGroup Arn : arn:aws:sagemaker:us-west-2:000000000000:model-package-group/mnist-handwritten-digit-claissification1660170021
[22]:
model_package_arn
[22]:
'arn:aws:sagemaker:us-west-2:000000000000:model-package-group/mnist-handwritten-digit-claissification1660170021'

Step 2: Get the best model training job from SageMaker experiments API

[23]:
best_trial_component_name = trial_component_analytics.dataframe().iloc[0]["TrialComponentName"]
best_trial_component = TrialComponent.load(best_trial_component_name)
[24]:
best_trial_component.trial_component_name
[24]:
'cnn-training-job-1660169695-aws-training-job'

Step 3: Register the best model.

By default, the model is registered with the approval_status set to PendingManualApproval. Users can then use API to manually approve the model based on any criteria set for model evaluation.

[25]:
# create model object
model_data = best_trial_component.output_artifacts["SageMaker.ModelArtifact"].value
env = {
    "hidden_channels": str(int(best_trial_component.parameters["hidden_channels"])),
    "dropout": str(best_trial_component.parameters["dropout"]),
    "kernel_size": str(int(best_trial_component.parameters["kernel_size"])),
}
model = PyTorchModel(
    model_data,
    role,
    "./mnist.py",
    py_version="py3",
    env=env,
    sagemaker_session=sagemaker.Session(sagemaker_client=sm),
    framework_version="1.1.0",
    name=best_trial_component.trial_component_name,
)
[26]:
model_package = model.register(
    content_types=["*"],
    response_types=["application/json"],
    inference_instances=["ml.m5.xlarge"],
    transform_instances=["ml.m5.xlarge"],
    description="MNIST image classification model",
    approval_status="PendingManualApproval",
    model_package_group_name=model_package_group_name,
)

Step 4: Verify model has been registered.

[27]:
sm.describe_model_package_group(ModelPackageGroupName=model_package_group_name)
[27]:
{'ModelPackageGroupName': 'mnist-handwritten-digit-claissification1660170021',
 'ModelPackageGroupArn': 'arn:aws:sagemaker:us-west-2:000000000000:model-package-group/mnist-handwritten-digit-claissification1660170021',
 'ModelPackageGroupDescription': 'Sample model package group',
 'CreationTime': datetime.datetime(2022, 8, 10, 22, 20, 20, 649000, tzinfo=tzlocal()),
 'ModelPackageGroupStatus': 'Completed',
 'ResponseMetadata': {'RequestId': '7005f604-0505-40f5-ba87-702f25c9a5fe',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '7005f604-0505-40f5-ba87-702f25c9a5fe',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '343',
   'date': 'Wed, 10 Aug 2022 22:20:27 GMT'},
  'RetryAttempts': 0}}
[28]:
## check model version
sm.list_model_packages(ModelPackageGroupName=model_package_group_name)
[28]:
{'ModelPackageSummaryList': [{'ModelPackageGroupName': 'mnist-handwritten-digit-claissification1660170021',
   'ModelPackageVersion': 1,
   'ModelPackageArn': 'arn:aws:sagemaker:us-west-2:000000000000:model-package/mnist-handwritten-digit-claissification1660170021/1',
   'ModelPackageDescription': 'MNIST image classification model',
   'CreationTime': datetime.datetime(2022, 8, 10, 22, 20, 25, 942000, tzinfo=tzlocal()),
   'ModelPackageStatus': 'Completed',
   'ModelApprovalStatus': 'PendingManualApproval'}],
 'ResponseMetadata': {'RequestId': '863ac307-c8c6-4251-a4fb-8200309913f6',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '863ac307-c8c6-4251-a4fb-8200309913f6',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '430',
   'date': 'Wed, 10 Aug 2022 22:20:28 GMT'},
  'RetryAttempts': 0}}
[29]:
model_package_arn = sm.list_model_packages(ModelPackageGroupName=model_package_group_name)[
    "ModelPackageSummaryList"
][0]["ModelPackageArn"]
[30]:
### Update the model status to approved
model_package_update_input_dict = {
    "ModelPackageArn": model_package_arn,
    "ModelApprovalStatus": "Approved",
}
model_package_update_response = sm.update_model_package(**model_package_update_input_dict)

Deploy an endpoint for the lastest approved version of the model from model registry

Now we take the best model and deploy it to an endpoint so it is available to perform inference.

[31]:
from datetime import datetime

now = datetime.now()
time = now.strftime("%m-%d-%Y-%H-%M-%S")
print("time:", time)
endpoint_name = f"cnn-mnist-{time}"
endpoint_name
time: 08-10-2022-22-20-30
[31]:
'cnn-mnist-08-10-2022-22-20-30'
[32]:
model_package.deploy(
    initial_instance_count=1, instance_type="ml.m5.xlarge", endpoint_name=endpoint_name
)
INFO:sagemaker:Creating model with name: 1-2022-08-10-22-20-31-635
INFO:sagemaker:Creating endpoint-config with name cnn-mnist-08-10-2022-22-20-30
INFO:sagemaker:Creating endpoint with name cnn-mnist-08-10-2022-22-20-30
----!

Cleanup

Once we’re done, clean up the endpoint to prevent unnecessary billing.

[33]:
sagemaker_client = boto3.client("sagemaker", region_name=region)
# Delete endpoint
sagemaker_client.delete_endpoint(EndpointName=endpoint_name)
[33]:
{'ResponseMetadata': {'RequestId': '1c522170-66e3-4c3d-8b95-e2a16880ad2d',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '1c522170-66e3-4c3d-8b95-e2a16880ad2d',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '0',
   'date': 'Wed, 10 Aug 2022 22:22:33 GMT'},
  'RetryAttempts': 0}}
[34]:
sagemaker_client.delete_endpoint_config(EndpointConfigName=endpoint_name)
[34]:
{'ResponseMetadata': {'RequestId': '9b9c4a53-70f8-4804-9913-e0d464249e29',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '9b9c4a53-70f8-4804-9913-e0d464249e29',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '0',
   'date': 'Wed, 10 Aug 2022 22:22:34 GMT'},
  'RetryAttempts': 0}}

Trial components can exist independently of trials and experiments. You might want keep them if you plan on further exploration. If not, delete all experiment artifacts.

[35]:
mnist_experiment.delete_all(action="--force")

Contact

Submit any questions or issues to https://github.com/aws/sagemaker-experiments/issues or mention @aws/sagemakerexperimentsadmin