Train an MNIST model with PyTorch


This notebook’s CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

This us-west-2 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable


MNIST is a widely used dataset for handwritten digit classification. It consists of 70,000 labeled 28x28 pixel grayscale images of hand-written digits. The dataset is split into 60,000 training images and 10,000 test images. There are 10 classes (one for each of the 10 digits). This tutorial shows how to train and test an MNIST model on SageMaker using PyTorch.

Runtime

This notebook takes approximately 5 minutes to run.

Contents

  1. PyTorch Estimator

  2. Implement the entry point for training

  3. Set hyperparameters

  4. Set up channels for the training and testing data

  5. Run the training script on SageMaker

  6. Inspect and store model data

  7. Test and debug the entry point before executing the training container

  8. Conclusion

[2]:
import os
import json

import sagemaker
from sagemaker.pytorch import PyTorch
from sagemaker import get_execution_role


sess = sagemaker.Session()

role = get_execution_role()

output_path = "s3://" + sess.default_bucket() + "/DEMO-mnist"

PyTorch Estimator

The PyTorch class allows you to run your training script on SageMaker infrastracture in a containerized environment. In this notebook, we refer to this container as training container.

You need to configure it with the following parameters to set up the environment:

  • entry_point: A user-defined Python file used by the training container as the instructions for training. We further discuss this file in the next subsection.

  • role: An IAM role to make AWS service requests

  • instance_type: The type of SageMaker instance to run your training script. Set it to local if you want to run the training job on the SageMaker instance you are using to run this notebook

  • instance_count: The number of instances to run your training job on. Multiple instances are needed for distributed training.

  • output_path: S3 bucket URI to save training output (model artifacts and output files)

  • framework_version: The version of PyTorch to use

  • py_version: The Python version to use

For more information, see the EstimatorBase API reference

Implement the entry point for training

The entry point for training is a Python script that provides all the code for training a PyTorch model. It is used by the SageMaker PyTorch Estimator (PyTorch class above) as the entry point for running the training job.

Under the hood, SageMaker PyTorch Estimator creates a docker image with runtime environemnts specified by the parameters you provide to initiate the estimator class, and it injects the training script into the docker image as the entry point to run the container.

In the rest of the notebook, we use training image to refer to the docker image specified by the PyTorch Estimator and training container to refer to the container that runs the training image.

This means your training script is very similar to a training script you might run outside Amazon SageMaker, but it can access the useful environment variables provided by the training image. See the complete list of environment variables for a complete description of all environment variables your training script can access.

In this example, we use the training script code/train.py as the entry point for our PyTorch Estimator.

[3]:
!pygmentize 'code/train.py'
import argparse
import gzip
import json
import logging
import os
import sys

import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset

logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler(sys.stdout))

# Based on https://github.com/pytorch/examples/blob/master/mnist/main.py
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
        self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
        self.conv2_drop = nn.Dropout2d()
        self.fc1 = nn.Linear(320, 50)
        self.fc2 = nn.Linear(50, 10)

    def forward(self, x):
        x = F.relu(F.max_pool2d(self.conv1(x), 2))
        x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
        x = x.view(-1, 320)
        x = F.relu(self.fc1(x))
        x = F.dropout(x, training=self.training)
        x = self.fc2(x)
        return F.log_softmax(x, dim=1)


# Decode binary data from SM_CHANNEL_TRAINING
# Decode and preprocess data
# Create map dataset


def normalize(x, axis):
    eps = np.finfo(float).eps
    mean = np.mean(x, axis=axis, keepdims=True)
    # avoid division by zero
    std = np.std(x, axis=axis, keepdims=True) + eps
    return (x - mean) / std


def convert_to_tensor(data_dir, images_file, labels_file):
    """Byte string to torch tensor"""
    with gzip.open(os.path.join(data_dir, images_file), "rb") as f:
        images = np.frombuffer(f.read(), np.uint8, offset=16).reshape(-1, 28, 28).astype(np.float32)

    with gzip.open(os.path.join(data_dir, labels_file), "rb") as f:
        labels = np.frombuffer(f.read(), np.uint8, offset=8).astype(np.int64)

    # normalize the images
    images = normalize(images, axis=(1, 2))

    # add channel dimension (depth-major)
    images = np.expand_dims(images, axis=1)

    # to torch tensor
    images = torch.tensor(images, dtype=torch.float32)
    labels = torch.tensor(labels, dtype=torch.int64)
    return images, labels


class MNIST(Dataset):
    def __init__(self, data_dir, train=True):

        if train:
            images_file = "train-images-idx3-ubyte.gz"
            labels_file = "train-labels-idx1-ubyte.gz"
        else:
            images_file = "t10k-images-idx3-ubyte.gz"
            labels_file = "t10k-labels-idx1-ubyte.gz"

        self.images, self.labels = convert_to_tensor(data_dir, images_file, labels_file)

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        return self.images[idx], self.labels[idx]


def train(args):
    use_cuda = args.num_gpus > 0
    device = torch.device("cuda" if use_cuda > 0 else "cpu")

    torch.manual_seed(args.seed)
    if use_cuda:
        torch.cuda.manual_seed(args.seed)

    train_loader = DataLoader(
        MNIST(args.train, train=True), batch_size=args.batch_size, shuffle=True
    )
    test_loader = DataLoader(
        MNIST(args.test, train=False), batch_size=args.test_batch_size, shuffle=False
    )

    net = Net().to(device)
    loss_fn = nn.CrossEntropyLoss()
    optimizer = optim.Adam(
        net.parameters(), betas=(args.beta_1, args.beta_2), weight_decay=args.weight_decay
    )

    logger.info("Start training ...")
    for epoch in range(1, args.epochs + 1):
        net.train()
        for batch_idx, (imgs, labels) in enumerate(train_loader, 1):
            imgs, labels = imgs.to(device), labels.to(device)
            output = net(imgs)
            loss = loss_fn(output, labels)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            if batch_idx % args.log_interval == 0:
                print(
                    "Train Epoch: {} [{}/{} ({:.0f}%)] Loss: {:.6f}".format(
                        epoch,
                        batch_idx * len(imgs),
                        len(train_loader.sampler),
                        100.0 * batch_idx / len(train_loader),
                        loss.item(),
                    )
                )

        # test the model
        test(net, test_loader, device)

    # save model checkpoint
    save_model(net, args.model_dir)
    return


def test(model, test_loader, device):
    model.eval()
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for imgs, labels in test_loader:
            imgs, labels = imgs.to(device), labels.to(device)
            output = model(imgs)
            test_loss += F.cross_entropy(output, labels, reduction="sum").item()

            pred = output.max(1, keepdim=True)[1]
            correct += pred.eq(labels.view_as(pred)).sum().item()

    test_loss /= len(test_loader.dataset)
    logger.info(
        "Test set: Average loss: {:.4f}, Accuracy: {}/{}, {})\n".format(
            test_loss, correct, len(test_loader.dataset), 100.0 * correct / len(test_loader.dataset)
        )
    )
    return


def save_model(model, model_dir):
    logger.info("Saving the model")
    path = os.path.join(model_dir, "model.pth")
    torch.save(model.cpu().state_dict(), path)
    return


def parse_args():
    parser = argparse.ArgumentParser()

    # Data and model checkpoints directories
    parser.add_argument(
        "--batch-size",
        type=int,
        default=64,
        metavar="N",
        help="input batch size for training (default: 64)",
    )
    parser.add_argument(
        "--test-batch-size",
        type=int,
        default=1000,
        metavar="N",
        help="input batch size for testing (default: 1000)",
    )
    parser.add_argument(
        "--epochs", type=int, default=1, metavar="N", help="number of epochs to train (default: 1)"
    )
    parser.add_argument(
        "--learning-rate",
        type=float,
        default=0.001,
        metavar="LR",
        help="learning rate (default: 0.01)",
    )
    parser.add_argument(
        "--beta_1", type=float, default=0.9, metavar="BETA1", help="beta1 (default: 0.9)"
    )
    parser.add_argument(
        "--beta_2", type=float, default=0.999, metavar="BETA2", help="beta2 (default: 0.999)"
    )
    parser.add_argument(
        "--weight-decay",
        type=float,
        default=1e-4,
        metavar="WD",
        help="L2 weight decay (default: 1e-4)",
    )
    parser.add_argument("--seed", type=int, default=1, metavar="S", help="random seed (default: 1)")
    parser.add_argument(
        "--log-interval",
        type=int,
        default=100,
        metavar="N",
        help="how many batches to wait before logging training status",
    )
    parser.add_argument(
        "--backend",
        type=str,
        default=None,
        help="backend for distributed training (tcp, gloo on cpu and gloo, nccl on gpu)",
    )

    # Container environment
    parser.add_argument("--hosts", type=list, default=json.loads(os.environ["SM_HOSTS"]))
    parser.add_argument("--current-host", type=str, default=os.environ["SM_CURRENT_HOST"])
    parser.add_argument("--model-dir", type=str, default=os.environ["SM_MODEL_DIR"])
    parser.add_argument("--train", type=str, default=os.environ["SM_CHANNEL_TRAINING"])
    parser.add_argument("--test", type=str, default=os.environ["SM_CHANNEL_TESTING"])
    parser.add_argument("--num-gpus", type=int, default=os.environ["SM_NUM_GPUS"])

    return parser.parse_args()


if __name__ == "__main__":
    args = parse_args()
    train(args)

Set hyperparameters

In addition, the PyTorch estimator allows you to parse command line arguments to your training script via hyperparameters.

Note: local mode is not supported in SageMaker Studio.

[4]:
# Set local_mode to True to run the training script on the machine that runs this notebook

local_mode = False

if local_mode:
    instance_type = "local"
else:
    instance_type = "ml.c4.xlarge"

est = PyTorch(
    entry_point="train.py",
    source_dir="code",  # directory of your training script
    role=role,
    framework_version="1.5.0",
    py_version="py3",
    instance_type=instance_type,
    instance_count=1,
    volume_size=250,
    output_path=output_path,
    hyperparameters={"batch-size": 128, "epochs": 1, "learning-rate": 1e-3, "log-interval": 100},
)

The training container executes your training script like:

python train.py --batch-size 100 --epochs 1 --learning-rate 1e-3 --log-interval 100

Set up channels for the training and testing data

Tell the PyTorch estimator where to find the training and testing data. It can be a path to an S3 bucket, or a path in your local file system if you use local mode. In this example, we download the MNIST data from a public S3 bucket and upload it to your default bucket.

[5]:
import logging
import boto3
from botocore.exceptions import ClientError

# Download training and testing data from a public S3 bucket


def download_from_s3(data_dir="./data", train=True):
    """Download MNIST dataset and convert it to numpy array

    Args:
        data_dir (str): directory to save the data
        train (bool): download training set

    Returns:
        None
    """

    if not os.path.exists(data_dir):
        os.makedirs(data_dir)

    if train:
        images_file = "train-images-idx3-ubyte.gz"
        labels_file = "train-labels-idx1-ubyte.gz"
    else:
        images_file = "t10k-images-idx3-ubyte.gz"
        labels_file = "t10k-labels-idx1-ubyte.gz"

    # download objects
    s3 = boto3.client("s3")
    bucket = f"sagemaker-sample-files"
    for obj in [images_file, labels_file]:
        key = os.path.join("datasets/image/MNIST", obj)
        dest = os.path.join(data_dir, obj)
        if not os.path.exists(dest):
            s3.download_file(bucket, key, dest)
    return


download_from_s3("./data", True)
download_from_s3("./data", False)
[6]:
# Upload to the default bucket

prefix = "DEMO-mnist"
bucket = sess.default_bucket()
loc = sess.upload_data(path="./data", bucket=bucket, key_prefix=prefix)

channels = {"training": loc, "testing": loc}

The keys of the channels dictionary are passed to the training image, and it creates the environment variable SM_CHANNEL_<key name>.

In this example, SM_CHANNEL_TRAINING and SM_CHANNEL_TESTING are created in the training image (see how code/train.py accesses these variables). For more information, see: SM_CHANNEL_{channel_name}.

If you want, you can create a channel for validation:

channels = {
    'training': train_data_loc,
    'validation': val_data_loc,
    'test': test_data_loc
}

You can then access this channel within your training script via SM_CHANNEL_VALIDATION.

Run the training script on SageMaker

Now, the training container has everything to execute your training script. Start the container by calling the fit() method.

[7]:
est.fit(inputs=channels)
2022-04-20 00:14:09 Starting - Starting the training job...
2022-04-20 00:14:36 Starting - Preparing the instances for trainingProfilerReport-1650413649: InProgress
.........
2022-04-20 00:16:04 Downloading - Downloading input data......
2022-04-20 00:17:04 Training - Downloading the training image...
2022-04-20 00:17:24 Training - Training image download completed. Training in progress.bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2022-04-20 00:17:23,149 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training
2022-04-20 00:17:23,167 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
2022-04-20 00:17:23,180 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
2022-04-20 00:17:23,187 sagemaker_pytorch_container.training INFO     Invoking user training script.
2022-04-20 00:17:23,570 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
2022-04-20 00:17:23,588 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
2022-04-20 00:17:23,603 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
2022-04-20 00:17:23,617 sagemaker-training-toolkit INFO     Invoking user script
Training Env:
{
    "additional_framework_parameters": {},
    "channel_input_dirs": {
        "testing": "/opt/ml/input/data/testing",
        "training": "/opt/ml/input/data/training"
    },
    "current_host": "algo-1",
    "framework_module": "sagemaker_pytorch_container.training:main",
    "hosts": [
        "algo-1"
    ],
    "hyperparameters": {
        "batch-size": 128,
        "epochs": 1,
        "learning-rate": 0.001,
        "log-interval": 100
    },
    "input_config_dir": "/opt/ml/input/config",
    "input_data_config": {
        "testing": {
            "TrainingInputMode": "File",
            "S3DistributionType": "FullyReplicated",
            "RecordWrapperType": "None"
        },
        "training": {
            "TrainingInputMode": "File",
            "S3DistributionType": "FullyReplicated",
            "RecordWrapperType": "None"
        }
    },
    "input_dir": "/opt/ml/input",
    "is_master": true,
    "job_name": "pytorch-training-2022-04-20-00-14-09-077",
    "log_level": 20,
    "master_hostname": "algo-1",
    "model_dir": "/opt/ml/model",
    "module_dir": "s3://sagemaker-us-west-2-000000000000/pytorch-training-2022-04-20-00-14-09-077/source/sourcedir.tar.gz",
    "module_name": "train",
    "network_interface_name": "eth0",
    "num_cpus": 4,
    "num_gpus": 0,
    "output_data_dir": "/opt/ml/output/data",
    "output_dir": "/opt/ml/output",
    "output_intermediate_dir": "/opt/ml/output/intermediate",
    "resource_config": {
        "current_host": "algo-1",
        "current_instance_type": "ml.c4.xlarge",
        "current_group_name": "homogeneousCluster",
        "hosts": [
            "algo-1"
        ],
        "instance_groups": [
            {
                "instance_group_name": "homogeneousCluster",
                "instance_type": "ml.c4.xlarge",
                "hosts": [
                    "algo-1"
                ]
            }
        ],
        "network_interface_name": "eth0"
    },
    "user_entry_point": "train.py"
}
Environment variables:
SM_HOSTS=["algo-1"]
SM_NETWORK_INTERFACE_NAME=eth0
SM_HPS={"batch-size":128,"epochs":1,"learning-rate":0.001,"log-interval":100}
SM_USER_ENTRY_POINT=train.py
SM_FRAMEWORK_PARAMS={}
SM_RESOURCE_CONFIG={"current_group_name":"homogeneousCluster","current_host":"algo-1","current_instance_type":"ml.c4.xlarge","hosts":["algo-1"],"instance_groups":[{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.c4.xlarge"}],"network_interface_name":"eth0"}
SM_INPUT_DATA_CONFIG={"testing":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"},"training":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}}
SM_OUTPUT_DATA_DIR=/opt/ml/output/data
SM_CHANNELS=["testing","training"]
SM_CURRENT_HOST=algo-1
SM_MODULE_NAME=train
SM_LOG_LEVEL=20
SM_FRAMEWORK_MODULE=sagemaker_pytorch_container.training:main
SM_INPUT_DIR=/opt/ml/input
SM_INPUT_CONFIG_DIR=/opt/ml/input/config
SM_OUTPUT_DIR=/opt/ml/output
SM_NUM_CPUS=4
SM_NUM_GPUS=0
SM_MODEL_DIR=/opt/ml/model
SM_MODULE_DIR=s3://sagemaker-us-west-2-000000000000/pytorch-training-2022-04-20-00-14-09-077/source/sourcedir.tar.gz
SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{"testing":"/opt/ml/input/data/testing","training":"/opt/ml/input/data/training"},"current_host":"algo-1","framework_module":"sagemaker_pytorch_container.training:main","hosts":["algo-1"],"hyperparameters":{"batch-size":128,"epochs":1,"learning-rate":0.001,"log-interval":100},"input_config_dir":"/opt/ml/input/config","input_data_config":{"testing":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"},"training":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}},"input_dir":"/opt/ml/input","is_master":true,"job_name":"pytorch-training-2022-04-20-00-14-09-077","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://sagemaker-us-west-2-000000000000/pytorch-training-2022-04-20-00-14-09-077/source/sourcedir.tar.gz","module_name":"train","network_interface_name":"eth0","num_cpus":4,"num_gpus":0,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_group_name":"homogeneousCluster","current_host":"algo-1","current_instance_type":"ml.c4.xlarge","hosts":["algo-1"],"instance_groups":[{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.c4.xlarge"}],"network_interface_name":"eth0"},"user_entry_point":"train.py"}
SM_USER_ARGS=["--batch-size","128","--epochs","1","--learning-rate","0.001","--log-interval","100"]
SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
SM_CHANNEL_TESTING=/opt/ml/input/data/testing
SM_CHANNEL_TRAINING=/opt/ml/input/data/training
SM_HP_BATCH-SIZE=128
SM_HP_EPOCHS=1
SM_HP_LEARNING-RATE=0.001
SM_HP_LOG-INTERVAL=100
PYTHONPATH=/opt/ml/code:/opt/conda/bin:/opt/conda/lib/python36.zip:/opt/conda/lib/python3.6:/opt/conda/lib/python3.6/lib-dynload:/opt/conda/lib/python3.6/site-packages
Invoking script with the following command:
/opt/conda/bin/python3.6 train.py --batch-size 128 --epochs 1 --learning-rate 0.001 --log-interval 100
Start training ...
[2022-04-20 00:17:26.738 algo-1:27 INFO json_config.py:90] Creating hook from json_config at /opt/ml/input/config/debughookconfig.json.
[2022-04-20 00:17:26.739 algo-1:27 INFO hook.py:192] tensorboard_dir has not been set for the hook. SMDebug will not be exporting tensorboard summaries.
[2022-04-20 00:17:26.739 algo-1:27 INFO hook.py:237] Saving to /opt/ml/output/tensors
[2022-04-20 00:17:26.739 algo-1:27 INFO state_store.py:67] The checkpoint config file /opt/ml/input/config/checkpointconfig.json does not exist.
[2022-04-20 00:17:26.739 algo-1:27 INFO hook.py:382] Monitoring the collections: losses
[2022-04-20 00:17:26.740 algo-1:27 INFO hook.py:443] Hook is writing from the hook with pid: 27
Train Epoch: 1 [12800/60000 (21%)] Loss: 0.571117
Train Epoch: 1 [25600/60000 (43%)] Loss: 0.435707
Train Epoch: 1 [38400/60000 (64%)] Loss: 0.278377
Train Epoch: 1 [51200/60000 (85%)] Loss: 0.247071
Test set: Average loss: 0.1151, Accuracy: 9642/10000, 96.42)
Saving the model
INFO:__main__:Test set: Average loss: 0.1151, Accuracy: 9642/10000, 96.42)
INFO:__main__:Saving the model
2022-04-20 00:17:43,442 sagemaker-training-toolkit INFO     Reporting training SUCCESS

2022-04-20 00:18:10 Uploading - Uploading generated training model
2022-04-20 00:18:24 Completed - Training job completed
ProfilerReport-1650413649: NoIssuesFound
Training seconds: 136
Billable seconds: 136

Inspect and store model data

Now, the training is finished, and the model artifact has been saved in the output_path.

[8]:
pt_mnist_model_data = est.model_data
print("Model artifact saved at:\n", pt_mnist_model_data)
Model artifact saved at:
 s3://sagemaker-us-west-2-000000000000/DEMO-mnist/pytorch-training-2022-04-20-00-14-09-077/output/model.tar.gz

We store the variable pt_mnist_model_data in the current notebook kernel.

[9]:
%store pt_mnist_model_data
Stored 'pt_mnist_model_data' (str)

Test and debug the entry point before executing the training container

The entry point code/train.py can be executed in the training container. When you develop your own training script, it is a good practice to simulate the container environment in the local shell and test it before sending it to SageMaker, because debugging in a containerized environment is rather cumbersome. The following script shows how you can test your training script:

[10]:
!pygmentize code/test_train.py
import json
import os
import sys

import boto3
from train import parse_args, train

dirname = os.path.dirname(os.path.abspath(__file__))

with open(os.path.join(dirname, "config.json"), "r") as f:
    CONFIG = json.load(f)


def download_from_s3(data_dir="/tmp/data", train=True):
    """Download MNIST dataset and convert it to numpy array

    Args:
        data_dir (str): directory to save the data
        train (bool): download training set

    Returns:
        tuple of images and labels as numpy arrays
    """

    if not os.path.exists(data_dir):
        os.makedirs(data_dir)

    if train:
        images_file = "train-images-idx3-ubyte.gz"
        labels_file = "train-labels-idx1-ubyte.gz"
    else:
        images_file = "t10k-images-idx3-ubyte.gz"
        labels_file = "t10k-labels-idx1-ubyte.gz"

    # download objects
    s3 = boto3.client("s3")
    bucket = CONFIG["public_bucket"]
    for obj in [images_file, labels_file]:
        key = os.path.join("datasets/image/MNIST", obj)
        dest = os.path.join(data_dir, obj)
        if not os.path.exists(dest):
            s3.download_file(bucket, key, dest)
    return


class Env:
    def __init__(self):
        # simulate container env
        os.environ["SM_MODEL_DIR"] = "/tmp/model"
        os.environ["SM_CHANNEL_TRAINING"] = "/tmp/data"
        os.environ["SM_CHANNEL_TESTING"] = "/tmp/data"
        os.environ["SM_HOSTS"] = '["algo-1"]'
        os.environ["SM_CURRENT_HOST"] = "algo-1"
        os.environ["SM_NUM_GPUS"] = "0"


if __name__ == "__main__":
    Env()
    args = parse_args()
    train(args)

Conclusion

In this notebook, we trained a PyTorch model on the MNIST dataset by fitting a SageMaker estimator. For next steps on how to deploy the trained model and perform inference, see Deploy a Trained PyTorch Model.

Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.

This us-east-1 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This us-east-2 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This us-west-1 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This ca-central-1 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This sa-east-1 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This eu-west-1 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This eu-west-2 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This eu-west-3 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This eu-central-1 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This eu-north-1 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This ap-southeast-1 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This ap-southeast-2 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This ap-northeast-1 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This ap-northeast-2 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This ap-south-1 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable