Building your own TensorFlow container

With Amazon SageMaker, you can package your own algorithms that can then be trained and deployed in the SageMaker environment. This notebook guides you through an example using TensorFlow that shows you how to build a Docker container for SageMaker and use it for training.

By packaging an algorithm in a container, you can bring almost any code to the Amazon SageMaker environment, regardless of programming language, environment, framework, or dependencies.

  1. Building your own TensorFlow container

  2. When should I build my own algorithm container?

  3. Permissions

  4. The example

  5. The presentation

  6. An overview of Docker

    1. Running your container during training

    2. The input

    3. The output

  7. The parts of the sample container

  8. The Dockerfile

  9. Building and registering the container

  10. Set up the environment

  11. Create the session

  12. Download the CIFAR-10 dataset

  13. Upload the data for training

  14. Training On SageMaker

  15. Reference

or I’m impatient, just let me see the code!

When should I build my own algorithm container?

You may not need to create a container to bring your own code to Amazon SageMaker. When you are using a framework such as Apache MXNet or TensorFlow that has direct support in SageMaker, you can simply supply the Python code that implements your algorithm using the SDK entry points for that framework. This set of supported frameworks is regularly added to, so you should check the current list to determine whether your algorithm is written in one of these common machine learning environments.

Even if there is direct SDK support for your environment or framework, you may find it more effective to build your own container. If the code that implements your algorithm is quite complex, or you need special additions to the framework, building your own container may be the right choice.

Some reasons to build an already supported framework container are: 1. A specific version isn’t supported. 2. Configure and install your dependencies and environment. 3. Use a different training solution than provided.

This walkthrough shows that it is quite straightforward to build your own container. So you can still use SageMaker even if your use case is not covered by the deep learning containers that we’ve built for you.

Permissions

Running this notebook requires permissions in addition to the normal SageMakerFullAccess permissions. This is because it creates new repositories on Amazon ECR. The easiest way to add these permissions is simply to add the managed policy AmazonEC2ContainerRegistryFullAccess to the role that you used to start your notebook instance. There’s no need to restart your notebook instance when you do this, the new permissions will be available immediately.

The example

In this example we show how to package a custom TensorFlow container with a Python example which works with the CIFAR-10 dataset.

In this example, we use a single image to support training. This simplifies the procedure because we only need to manage one image for both tasks. Sometimes you may want separate images for training because they have different requirements. In this case, separate the parts discussed below into separate Dockerfiles and build two images. Choosing whether to use a single image or two images is a matter of what is most convenient for you to develop and manage.

The presentation

This presentation shows you two things: building the container and using the container.

Part 1: Packaging and Uploading your Algorithm for use with Amazon SageMaker

An overview of Docker

If you’re familiar with Docker already, you can skip ahead to the next section.

For many data scientists, Docker containers are a new technology. But they are not difficult and can significantly simply the deployment of your software packages.

Docker provides a simple way to package arbitrary code into an image that is totally self-contained. Once you have an image, you can use Docker to run a container based on that image. Running a container is just like running a program on the machine except that the container creates a fully self-contained environment for the program to run. Containers are isolated from each other and from the host environment, so the way your program is set up is the way it runs, no matter where you run it.

Docker is more powerful than environment managers like conda or virtualenv because (a) it is completely language independent and (b) it comprises your whole operating environment, including startup commands, and environment variable.

A Docker container is like a virtual machine, but it is much lighter weight. For example, a program running in a container can start in less than a second and many containers can run simultaneously on the same physical or virtual machine instance.

Docker uses a simple file called a Dockerfile to specify how the image is assembled. An example is provided below. You can build your Docker images based on Docker images built by yourself or by others, which can simplify things quite a bit.

Docker has become very popular in programming and devops communities due to its flexibility and its well-defined specification of how code can be run in its containers. It is the underpinning of many services built in the past few years, such as Amazon ECS.

Amazon SageMaker uses Docker to allow users to train and deploy arbitrary algorithms.

In Amazon SageMaker, Docker containers are invoked in a one way for training and another, slightly different, way for hosting. The following sections outline how to build containers for the SageMaker environment.

Some helpful links:

Running your container during training

When Amazon SageMaker runs training, your train script is run, as in a regular Python program. A number of files are laid out for your use, under the /opt/ml directory:

/opt/ml
|-- input
|   |-- config
|   |   |-- hyperparameters.json
|   |    -- resourceConfig.json
|    -- data
|        -- <channel_name>
|            -- <input data>
|-- model
|   -- <model files>
 -- output
    -- failure
The input
  • /opt/ml/input/config contains information to control how your program runs. hyperparameters.json is a JSON-formatted dictionary of hyperparameter names to values. These values are always strings, so you may need to convert them. resourceConfig.json is a JSON-formatted file that describes the network layout used for distributed training.

  • /opt/ml/input/data/<channel_name>/ (for File mode) contains the input data for that channel. The channels are created based on the call to CreateTrainingJob, but it’s generally important that channels match algorithm expectations. The files for each channel are copied from S3 to this directory, preserving the tree structure indicated by the S3 key structure.

  • /opt/ml/input/data/<channel_name>_<epoch_number> (for Pipe mode) is the pipe for a given epoch. Epochs start at zero and go up by one each time you read them. There is no limit to the number of epochs that you can run, but you must close each pipe before reading the next epoch.

The output
  • /opt/ml/model/ is the directory where you write the model that your algorithm generates. Your model can be in any format that you want. It can be a single file or a whole directory tree. SageMaker packages any files in this directory into a compressed tar archive file. This file is made available at the S3 location returned to the DescribeTrainingJob result.

  • /opt/ml/output is a directory where the algorithm can write a file failure that describes why the job failed. The contents of this file are returned to the FailureReason field of the DescribeTrainingJob result. For jobs that succeed, there is no reason to write this file as it is ignored.

The parts of the sample container

The container directory has all the components you need to package the sample algorithm for Amazon SageMager:

.
|-- Dockerfile
|-- build_and_push.sh
`-- cifar10
    |-- cifar10.py
    |-- resnet_model.py
    `-- train

Let’s discuss each of these in turn:

  • ``Dockerfile`` describes how to build your Docker container image. More details are provided below.

  • ``build_and_push.sh`` is a script that uses the Dockerfile to build your container images and then pushes it to ECR. We invoke the commands directly later in this notebook, but you can just copy and run the script for your own algorithms.

  • ``cifar10`` is the directory which contains the files that are installed in the container.

In this simple application, we install only five files in the container. You may only need that many, but if you have many supporting routines, you may wish to install more. These five files show the standard structure of our Python containers, although you are free to choose a different tool set and therefore could have a different layout. If you’re writing in a different programming language, you will have a different layout depending on the frameworks and tools you choose.

The files that we put in the container are:

  • ``cifar10.py`` is the program that implements our training algorithm.

  • ``resnet_model.py`` is the program that contains our Resnet model.

  • ``train`` is the program that is invoked when the container is run for training. Our implementation of this script invokes cifar10.py with our hyperparameter values retrieved from /opt/ml/input/config/hyperparameters.json. The goal for doing this is to avoid having to modify our training algorithm program.

In summary, the file you probably want to change for your application is train.

The Dockerfile

The Dockerfile describes the image that we want to build. You can think of it as describing the complete operating system installation of the system that you want to run. A Docker container running is quite a bit lighter than a full operating system, however, because it takes advantage of Linux on the host machine for the basic operations.

For the Python science stack, we start from an official TensorFlow docker image and add our modifications to it. Then we add the code that implements our specific algorithm to the container and set up the right environment for it to run under.

Let’s look at the Dockerfile for this example.

[ ]:
!cat container/Dockerfile

Building and registering the container

The following shell code shows how to build the container image using docker build and push the container image to ECR using docker push. This code is also available as the shell script container/build-and-push.sh, which you can run as build-and-push.sh sagemaker-tf-cifar10-example to build the image sagemaker-tf-cifar10-example.

This code looks for an ECR repository in the account you’re using and the current default region (if you’re using a SageMaker notebook instance, this is the region where the notebook instance was created). If the repository doesn’t exist, the script will create it.

[ ]:
%%sh

# The name of our algorithm
algorithm_name=sagemaker-tf-cifar10-example

cd container

chmod +x cifar10/train

account=$(aws sts get-caller-identity --query Account --output text)

# Get the region defined in the current configuration (default to us-west-2 if none defined)
region=$(aws configure get region)
region=${region:-us-west-2}

fullname="${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest"

# If the repository doesn't exist in ECR, create it.

aws ecr describe-repositories --repository-names "${algorithm_name}" > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "${algorithm_name}" > /dev/null
fi

# Get the login command from ECR and execute it directly
$(aws ecr get-login --region ${region} --no-include-email)

# Build the docker image locally with the image name and then push it to ECR
# with the full name.

docker build  -t ${algorithm_name} .
docker tag ${algorithm_name} ${fullname}

docker push ${fullname}

Part 2: Training your Algorithm in Amazon SageMaker

Once you have your container packaged, you can use it to train models. Let’s do that with the algorithm we made above.

Set up the environment

Here we specify the bucket to use and the role that is used for working with SageMaker.

[ ]:
# S3 prefix
prefix = "DEMO-tensorflow-cifar10"

Create the session

The session remembers our connection parameters to SageMaker. We use it to perform all of our SageMaker operations.

[ ]:
import sagemaker as sage

sess = sage.Session()

Download the CIFAR-10 dataset

Our training algorithm is expecting our training data to be in the file format of `TFRecords <https://www.tensorflow.org/guide/datasets>`__, which is a simple record-oriented binary format that many TensorFlow applications use for training data. Below is a Python script adapted from the official TensorFlow CIFAR-10 example, which downloads the CIFAR-10 dataset and converts them into TFRecords.

[ ]:
! python utils/generate_cifar10_tfrecords.py --data-dir=/tmp/cifar-10-data
[ ]:
# There should be three tfrecords. (eval, train, validation)
! ls /tmp/cifar-10-data

Upload the data for training

We will use the tools provided by the SageMaker Python SDK to upload the data to a default bucket.

[ ]:
WORK_DIRECTORY = "/tmp/cifar-10-data"

data_location = sess.upload_data(WORK_DIRECTORY, key_prefix=prefix)

Training on SageMaker

Training a model on SageMaker with the Python SDK is done in a way that is similar to the way we trained it locally. This is done by changing our train_instance_type from local to one of our supported EC2 instance types.

In addition, we must now specify the ECR image URL, which we just pushed above.

Finally, our local training dataset has to be in Amazon S3 and the S3 URL to our dataset is passed into the fit() call.

Let’s first fetch our ECR image url that corresponds to the image we just built and pushed.

[ ]:
import boto3
from sagemaker import get_execution_role

client = boto3.client("sts")
account = client.get_caller_identity()["Account"]

my_session = boto3.session.Session()
region = my_session.region_name
role = get_execution_role()

algorithm_name = "sagemaker-tf-cifar10-example"

ecr_image = "{}.dkr.ecr.{}.amazonaws.com/{}:latest".format(account, region, algorithm_name)

print(ecr_image)
[ ]:
from sagemaker.estimator import Estimator

hyperparameters = {"train-steps": 100}

instance_type = "ml.m4.xlarge"

estimator = Estimator(
    role=role,
    instance_count=1,
    instance_type=instance_type,
    image_uri=ecr_image,
    hyperparameters=hyperparameters,
)

estimator.fit(data_location)

The model artifacts can be found at the following S3 bucket:

[ ]:
model_dir = sess.sagemaker_client.describe_training_job(
    TrainingJobName=estimator.latest_training_job.name
)["ModelArtifacts"]["S3ModelArtifacts"]
! echo $model_dir && aws s3 ls $model_dir

Reference

[ ]: