Bring your own pipe-mode algorithm to Amazon SageMaker

This notebook’s CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

Create a Docker container for training SageMaker algorithms using Pipe-mode

Contents

Overview
Preparation
Permissions
Code
train.py
Dockerfile
Customize
Train
Conclusion

Preparation

This notebook was created and tested on an ml.t2.medium notebook instance.

Let’s start by specifying:

S3 URIs s3_training_input and s3_model_output that you want to use for training input and model data respectively. These should be within the same region as the Notebook Instance, training, and hosting. Since the “algorithm” you’re building here doesn’t really have any specific data-format, feel free to point s3_training_input to any s3 dataset you have, the bigger the dataset the better to test the raw IO throughput performance. For this example, the California Housing dataset will be copied over to your s3 bucket.
The training_instance_type to use for training. More powerful instance types have more CPU and bandwidth which would result in higher throughput.
The IAM role arn used to give training access to your data.

The California Housing dataset was originally published in:

Pace, R. Kelley, and Ronald Barry. “Sparse spatial autoregressions.” Statistics & Probability Letters 33.3 (1997): 291-297.

Permissions

Running this notebook requires permissions in addition to the normal SageMakerFullAccess permissions. This is because you’ll be creating a new repository in Amazon ECR. The easiest way to add these permissions is simply to add the managed policy AmazonEC2ContainerRegistryFullAccess to the role that you used to start your notebook instance. There’s no need to restart your notebook instance when you do this, the new permissions will be available immediately.

[ ]:

import boto3
import pandas as pd
import sagemaker

from sklearn.datasets import fetch_california_housing

# Get SageMaker session & default S3 bucket
role = sagemaker.get_execution_role()
sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name
s3 = sagemaker_session.boto_session.resource("s3")
bucket = sagemaker_session.default_bucket()  # replace with your own bucket name if you have one

[ ]:

# helper functions to upload data to s3
def write_to_s3(filename, bucket, prefix):
    filename_key = filename.split(".")[0]
    key = "{}/{}/{}".format(prefix, filename_key, filename)
    return s3.Bucket(bucket).upload_file(filename, key)


def upload_to_s3(bucket, prefix, filename):
    url = "s3://{}/{}/{}".format(bucket, prefix, filename)
    print("Writing data to {}".format(url))
    write_to_s3(filename, bucket, prefix)

If you have a larger dataset you want to try, here is the place to swap in your dataset.

[ ]:

filename = "california_housing.csv"
# Download files from sklearns.datasets
tabular_data = fetch_california_housing()
tabular_data_full = pd.DataFrame(tabular_data.data, columns=tabular_data.feature_names)
tabular_data_full["target"] = pd.DataFrame(tabular_data.target)
tabular_data_full.to_csv(filename, index=False)

Upload the dataset to your bucket. You’ll find it with the ‘pipe_bring_your_own/training’ prefix.

[ ]:

prefix = "pipe_bring_your_own/training"
training_data = "s3://{}/{}".format(bucket, prefix)
print("Training data in {}".format(training_data))
upload_to_s3(bucket, prefix, filename)

Code

For the purposes of this demo you’re going to write an extremely simple “training” algorithm in Python. In essence it will conform to the specifications required by SageMaker Training and will read data in Pipe-mode but will do nothing with the data, simply reading it and throwing it away. You’re doing it this way to be able to illustrate only exactly what’s needed to support Pipe-mode without complicating the code with a real training algorithm.

In Pipe-mode, data is pre-fetched from S3 at high-concurrency and throughput and streamed into Unix Named Pipes (aka FIFOs) - one FIFO per Channel per epoch. The algorithm must open the FIFO for reading and read through to (or optionally abort mid-stream) and close its end of the file descriptor when done. It can then optionally wait for the next epoch’s FIFO to get created and commence reading, iterating through epochs until it has achieved its completion criteria.

For this example, you’ll need two supporting files:

train.py

train.py simply iterates through 5 epochs on the training Channel. Each epoch involves reading the training data stream from a FIFO named /opt/ml/input/data/training_${epoch}. At the end of the epoch the code simply iterates to the next epoch, waits for the new epoch’s FIFO to get created and continues on.

A lot of the code in train.py is merely boilerplate code, dealing with printing log messages, trapping termination signals etc. The main code that iterates through reading each epoch’s data through its corresponding FIFO is the following:

[ ]:

!pygmentize train.py

Dockerfile

You can use any of the preconfigured Docker containers that SageMaker provides, or build one from scratch. This example uses the PyTorch - AWS Deep Learning Container, then adds train.py, and finally runs train.py when the entrypoint is launched. To learn more about bring your own container training options, see the Amazon SageMaker Training Toolkit.

[ ]:

%cat Dockerfile

Customize

To fetch the PyTorch AWS Deep Learning Container (DLC), first login to ECR.

[ ]:

%%sh
REGION=$(aws configure get region)
account=$(aws sts get-caller-identity --query Account --output text)
docker login --username AWS --password $(aws ecr get-login-password --region us-west-2) 763104351884.dkr.ecr.us-west-2.amazonaws.com
aws ecr get-login-password --region ${REGION} | docker login --username AWS --password-stdin ${account}.dkr.ecr.${REGION}.amazonaws.com

Next, build your custom docker container, tagging it with the name “pipe_bring_your_own”.

[ ]:

%%sh
docker build -t pipe_bring_your_own . --build-arg region=$(aws configure get region)

With the container built, you can now tag it with the full name you will need when calling it for training (ecr_image). Then upload your custom container to ECR.

[ ]:

account = !aws sts get-caller-identity --query Account --output text
algorithm_name = "pipe_bring_your_own"
ecr_image = '{}.dkr.ecr.{}.amazonaws.com/{}:latest'.format(account[0], region, algorithm_name)
print('ecr_image: {}'.format(ecr_image))

ecr_client = boto3.client('ecr')
try:
    response = ecr_client.describe_repositories(
        repositoryNames=[
            algorithm_name,
        ],
    )
    print("Repo exists...")
except Exception as e:
    create_repo = ecr_client.create_repository(repositoryName=algorithm_name)
    print("Created repo...")

!docker tag {algorithm_name} {ecr_image}
!docker push {ecr_image}

Train

Now, you will use the Estimator function and pass in the information needed to run the training container in SageMaker. Note that input_mode is the parameter required for you to set pipe mode for this training run. Also note that the base_job_name doesn’t let you use underscores, so that’s why you’re using dashes.

[ ]:

from sagemaker.estimator import Estimator

estimator = Estimator(
    image_uri=ecr_image,
    role=role,
    base_job_name="pipe-bring-your-own-test",
    instance_count=1,
    instance_type="ml.c4.xlarge",
    input_mode="Pipe",
)

# Start training
estimator.fit(training_data)

Note the throughput logged by the training logs above. By way of comparison a File-mode algorithm will achieve at most approximately 150MB/s on a high-end ml.c5.18xlarge and approximately 75MB/s on a ml.m4.xlarge.

Conclusion

There are a few situations where Pipe-mode may not be the optimum choice for training in which case you should stick to using File-mode:

If your algorithm needs to backtrack or skip ahead within an epoch. This is simply not possible in Pipe-mode since the underlying FIFO cannot not support lseek() operations.
If your training dataset is small enough to fit in memory and you need to run multiple epochs. In this case may be quicker and easier just to load it all into memory and iterate.
Your training dataset is not easily parse-able from a streaming source.

In all other scenarios, if you have an IO-bound training algorithm, switching to Pipe-mode may give you a significant throughput-boost and will reduce the size of the disk volume required. This should result in both saving you time and reducing training costs.

You can read more about building your own training algorithms in the SageMaker Training documentation.

Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.