Training a Tensorflow Model on MNIST

MNIST is a widely used dataset for handwritten digit classification. It consists of 70,000 labeled 28x28 pixel grayscale images of hand-written digits. The dataset is split into 60,000 training images and 10,000 test images. There are 10 classes (one for each of the 10 digits). This tutorial will show how to train a Tensorflow V2 model on MNIST model on SageMaker.

[ ]:
import os
import json

import sagemaker
from sagemaker.tensorflow import TensorFlow
from sagemaker import get_execution_role

sess = sagemaker.Session()

role = get_execution_role()

output_path = "s3://" + sess.default_bucket() + "/tensorflow/mnist"

TensorFlow Estimator

The TensorFlow class allows you to run your training script on SageMaker infrastracture in a containerized environment. In this notebook, we refer to this container as training container.

You need to configure it with the following parameters to set up the environment:

  • entry_point: A user defined python file to be used by the training container as the instructions for training. We will further discuss this file in the next subsection

  • role: An IAM role to make AWS service requests

  • instance_type: The type of SageMaker instance to run your training script. Set it to local if you want to run the training job on the SageMaker instance you are using to run this notebook

  • model_dir: S3 bucket URI where the checkpoint data and models can be exported to during training (default: None). To disable having model_dir passed to your training script, set model_dir=False

  • instance count: The number of instances you need to run your training job. Multiple instances are needed for distributed training

  • output_path: S3 bucket URI to save training output (model artifacts and output files)

  • framework_version: The version of TensorFlow you need to use.

  • py_version: The python version you need to use

For more information, see the API reference

Implement the entry point for training

The entry point for training is a python script that provides all the code for training a TensorFlow model. It is used by the SageMaker TensorFlow Estimator (TensorFlow class above) as the entry point for running the training job.

Under the hood, SageMaker TensorFlow Estimator downloads a docker image with runtime environemnts specified by the parameters you used to initiated the estimator class and it injects the training script into the docker image to be used as the entry point to run the container.

In the rest of the notebook, we use training image to refer to the docker image specified by the TensorFlow Estimator and training container to refer to the container that runs the training image.

This means your training script is very similar to a training script you might run outside Amazon SageMaker, but it can access the useful environment variables provided by the training image. Checkout the short list of environment variables provided by the SageMaker service to see some common environment variables you might used. Checkout the complete list of environment variables for a complete description of all environment variables your training script can access to.

In this example, we use the training script code/train.py as the entry point for our TensorFlow Estimator.

[ ]:
!pygmentize 'code/train.py'

Set hyperparameters

In addition, TensorFlow estimator allows you to parse command line arguments to your training script via hyperparameters.

Note: local mode is not supported in SageMaker Studio.

[ ]:
# set local_mode to be True if you want to run the training script
# on the machine that runs this notebook

local_mode = False

if local_mode:
    instance_type = "local"
else:
    instance_type = "ml.c4.xlarge"

est = TensorFlow(
    entry_point="train.py",
    source_dir="code",  # directory of your training script
    role=role,
    framework_version="2.3.1",
    model_dir=False,  # don't pass --model_dir to your training script
    py_version="py37",
    instance_type=instance_type,
    instance_count=1,
    output_path=output_path,
    hyperparameters={
        "batch-size": 512,
        "epochs": 1,
        "learning-rate": 1e-3,
        "beta_1": 0.9,
        "beta_2": 0.999,
    },
)

The training container executes your training script like

python train.py --batch-size 32 --epochs 1 --learning-rate 0.001
    --beta_1 0.9 --beta_2 0.999

Set up channels for training and testing data

You need to tell TensorFlow estimator where to find your training and testing data. It can be a link to an S3 bucket or it can be a path in your local file system if you use local mode. In this example, we download the MNIST data from a public S3 bucket and upload it to your default bucket.

[ ]:
import logging
import boto3
from botocore.exceptions import ClientError

# Download training and testing data from a public S3 bucket


def download_from_s3(data_dir="/tmp/data", train=True):
    """Download MNIST dataset and convert it to numpy array

    Args:
        data_dir (str): directory to save the data
        train (bool): download training set

    Returns:
        None
    """

    if not os.path.exists(data_dir):
        os.makedirs(data_dir)

    if train:
        images_file = "train-images-idx3-ubyte.gz"
        labels_file = "train-labels-idx1-ubyte.gz"
    else:
        images_file = "t10k-images-idx3-ubyte.gz"
        labels_file = "t10k-labels-idx1-ubyte.gz"

    with open("code/config.json", "r") as f:
        config = json.load(f)

    # download objects
    s3 = boto3.client("s3")
    bucket = config["public_bucket"]
    for obj in [images_file, labels_file]:
        key = os.path.join("datasets/image/MNIST", obj)
        dest = os.path.join(data_dir, obj)
        if not os.path.exists(dest):
            s3.download_file(bucket, key, dest)
    return


download_from_s3("/tmp/data", True)
download_from_s3("/tmp/data", False)
[ ]:
# upload to the default bucket

prefix = "mnist"
bucket = sess.default_bucket()
loc = sess.upload_data(path="/tmp/data", bucket=bucket, key_prefix=prefix)

channels = {"training": loc, "testing": loc}

The keys of the dictionary channels are parsed to the training image and it creates the environment variable SM_CHANNEL_<key name>.

In this example, SM_CHANNEL_TRAINING and SM_CHANNEL_TESTING are created in the training image (checkout how code/train.py access these variables). For more information, see: SM_CHANNEL_{channel_name}

If you want, you can create a channel for validation:

channels = {
    'training': train_data_loc,
    'validation': val_data_loc,
    'test': test_data_loc
    }

You can then access this channel within your training script via SM_CHANNEL_VALIDATION.

Run the training script on SageMaker

Now, the training container has everything to execute your training script. You can start the container by calling fit method.

[ ]:
est.fit(inputs=channels)

Inspect and store model data

Now, the training is finished, the model artifact has been saved in the output_path. We

[ ]:
tf_mnist_model_data = est.model_data
print("Model artifact saved at:\n", tf_mnist_model_data)

We will store the variable model_data in the current notebook kernel. In the next notebook, you will learn how to retrieve the model artifact and deploy to a SageMaker endpoint.

[ ]:
%store tf_mnist_model_data

Test and debug the entry point before executing the training container

The entry point code/train.py provided here has been tested and it can be executed in the training container. When you develop your own training script, it is a good practice to simulate the container environment in the local shell and test it before sending it to SageMaker, because debugging in a containerized environment is rather cumbersome. The following script shows how you can test your training script:

[ ]:
!pygmentize code/test_train.py

In the next notebook you will see how to deploy your trained model artifacts to a SageMaker endpoint.