Compile and Train a Vision Transformer Model on the Caltech-256 Dataset using a Single Node

This notebook’s CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

Introduction
Development Environment and Permissions
1. Installation
2. SageMaker environment
Working with the Caltech-256 dataset
SageMaker Training Job
Analysis
1. Savings from Training Compiler
2. Convergence of Training
Clean up

SageMaker Training Compiler Overview

SageMaker Training Compiler is a capability of SageMaker that makes hard-to-implement optimizations to reduce training time on GPU instances. The compiler optimizes Deep Learning (DL) models to accelerate training by more efficiently using SageMaker machine learning (ML) GPU instances. SageMaker Training Compiler is available at no additional charge within SageMaker and can help reduce total billable time as it accelerates training.

SageMaker Training Compiler is integrated into the AWS Deep Learning Containers (DLCs). Using the SageMaker Training Compiler enabled AWS DLCs, you can compile and optimize training jobs on GPU instances with minimal changes to your code. Bring your deep learning models to SageMaker and enable SageMaker Training Compiler to accelerate the speed of your training job on SageMaker ML instances for accelerated computing.

For more information, see SageMaker Training Compiler in the Amazon SageMaker Developer Guide.

Introduction

In this demo, you’ll use Amazon SageMaker Training Compiler to train the Vision Transformer model on the Caltech-256 dataset. To get started, we need to set up the environment with a few prerequisite steps, for permissions, configurations, and so on.

NOTE: You can run this demo in SageMaker Studio, SageMaker notebook instances, or your local machine with AWS CLI set up. If using SageMaker Studio or SageMaker notebook instances, make sure you choose one of the TensorFlow-based kernels, Python 3 (TensorFlow x.y Python 3.x CPU Optimized) or conda_tensorflow_p39 respectively.

NOTE: This notebook uses a ml.p3.2xlarge instance with a single GPU. However, it can easily be extended to multiple GPUs on a single node. If you don’t have enough quota, see Request a service quota increase for SageMaker resources.

Development Environment

Installation

This example notebook requires SageMaker Python SDK v2.115.0 or later

[ ]:

!pip install "sagemaker>=2.129" botocore boto3 awscli matplotlib --upgrade

[ ]:

import botocore
import boto3
import sagemaker

print(f"botocore: {botocore.__version__}")
print(f"boto3: {boto3.__version__}")
print(f"sagemaker: {sagemaker.__version__}")

SageMaker environment

[ ]:

import sagemaker

sess = sagemaker.Session()

# SageMaker session bucket -> used for uploading data, models and logs
# SageMaker will automatically create this bucket if it does not exist
sagemaker_session_bucket = None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

role = sagemaker.get_execution_role()
sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sagemaker_session_bucket}")
print(f"sagemaker session region: {sess.boto_region_name}")

Working with the Caltech-256 dataset

We have hosted the Caltech-256 dataset in S3 in us-east-1. We will transfer this dataset to your account and region for use with SageMaker Training.

The dataset consists of JPEG images organized into directories with each directory representing an object category.

[ ]:

import os

source = "s3://sagemaker-sample-files/datasets/image/caltech-256/256_ObjectCategories"
destn = f"s3://{sagemaker_session_bucket}/caltech-256"
local = "caltech-256"

os.system(f"aws s3 sync {source} {local}")
os.system(f"aws s3 sync {local} {destn}")

SageMaker Training Job

To create a SageMaker training job, we use a TensorFlow estimator. Using the estimator, you can define which training script should SageMaker use through entry_point, which instance_type to use for training, which hyperparameters to pass, and so on.

When a SageMaker training job starts, SageMaker takes care of starting and managing all the required machine learning instances, picks up the TensorFlow Deep Learning Container, uploads your training script, and downloads the data from sagemaker_session_bucket into the container at /opt/ml/input/data.

In the following section, you learn how to set up two versions of the SageMaker TensorFlow estimator, a native one without the compiler and an optimized one with the compiler.

Training with Native TensorFlow

The BATCH_SIZE in the following code cell is the maximum batch that can fit into the memory of a ml.p3.2xlarge instance while giving the best training speed. If you change the model, instance type, and other parameters, you need to do some experiments to find the largest batch size that will fit into GPU memory.

Set EPOCHS to the number of times you would like to loop over the training data.

[ ]:

from sagemaker.tensorflow import TensorFlow

EPOCHS = 10
BATCH_SIZE = 77
LEARNING_RATE = 1e-3
WEIGHT_DECAY = 1e-4

kwargs = dict(
    source_dir="scripts",
    entry_point="vit.py",
    model_dir=False,
    instance_type="ml.p3.2xlarge",
    instance_count=1,
    framework_version="2.11",
    py_version="py39",
    debugger_hook_config=None,
    disable_profiler=True,
    max_run=60 * 60,  # 60 minutes
    role=role,
    metric_definitions=[
        {"Name": "training_loss", "Regex": "loss: ([0-9.]*?) "},
        {"Name": "training_accuracy", "Regex": "accuracy: ([0-9.]*?) "},
        {"Name": "training_latency_per_epoch", "Regex": "- ([0-9.]*?)s/epoch"},
        {"Name": "training_avg_latency_per_step", "Regex": "- ([0-9.]*?)ms/step"},
    ],
)

# Configure the training job
native_estimator = TensorFlow(
    hyperparameters={
        "EPOCHS": EPOCHS,
        "BATCH_SIZE": BATCH_SIZE,
        "LEARNING_RATE": LEARNING_RATE,
        "WEIGHT_DECAY": WEIGHT_DECAY,
    },
    base_job_name="native-tf210-vit",
    **kwargs,
)

[ ]:

# Start training with our uploaded datasets as input
native_estimator.fit(inputs=destn, wait=False)

# The name of the training job.
native_estimator.latest_training_job.name

Training with Optimized TensorFlow

Compilation through Training Compiler changes the memory footprint of the model. Most commonly, this manifests as a reduction in memory utilization and a consequent increase in the largest batch size that can fit on the GPU. But in some cases the compiler intelligently promotes caching which leads to a decrease in the largest batch size that can fit on the GPU. Note that if you want to change the batch size, you must adjust the learning rate appropriately.

Note: We recommend you to turn the SageMaker Debugger’s profiling and debugging tools off when you use compilation to avoid additional overheads.

[ ]:

from sagemaker.tensorflow import TensorFlow, TrainingCompilerConfig

OPTIMIZED_BATCH_SIZE = 56
LEARNING_RATE = LEARNING_RATE / BATCH_SIZE * OPTIMIZED_BATCH_SIZE
WEIGHT_DECAY = WEIGHT_DECAY * BATCH_SIZE / OPTIMIZED_BATCH_SIZE

# Configure the training job
optimized_estimator = TensorFlow(
    hyperparameters={
        "EPOCHS": EPOCHS,
        "BATCH_SIZE": OPTIMIZED_BATCH_SIZE,
        "LEARNING_RATE": LEARNING_RATE,
        "WEIGHT_DECAY": WEIGHT_DECAY,
    },
    compiler_config=TrainingCompilerConfig(),
    base_job_name="optimized-tf210-vit",
    **kwargs,
)

[ ]:

# Start training with our uploaded datasets as input
optimized_estimator.fit(inputs=destn, wait=False)

# The name of the training job.
optimized_estimator.latest_training_job.name

Wait for training jobs to complete

The training jobs described above typically take around 40 mins to complete

Note: If the estimator object is no longer available due to a kernel break or refresh, you need to directly use the training job name and manually attach the training job to a new TensorFlow estimator. For example:

native_estimator = TensorFlow.attach("<your_training_job_name>")

[ ]:

waiter = sess.sagemaker_client.get_waiter("training_job_completed_or_stopped")

waiter.wait(TrainingJobName=native_estimator.latest_training_job.name)
waiter.wait(TrainingJobName=optimized_estimator.latest_training_job.name)

[ ]:

native_estimator = TensorFlow.attach(native_estimator.latest_training_job.name)
optimized_estimator = TensorFlow.attach(optimized_estimator.latest_training_job.name)

Analysis

Here we view the training metrics from the training jobs as a Pandas dataframe

[ ]:

import pandas as pd

# Extract training metrics from the estimator
native_metrics = native_estimator.training_job_analytics.dataframe()

# Restructure table for viewing
for metric in native_metrics["metric_name"].unique():
    native_metrics[metric] = native_metrics[native_metrics["metric_name"] == metric]["value"]
native_metrics = native_metrics.drop(columns=["metric_name", "value"])
native_metrics = native_metrics.groupby("timestamp").max()
native_metrics["epochs"] = range(1, 11)
native_metrics = native_metrics.set_index("epochs")

native_metrics

[ ]:

import pandas as pd

# Extract training metrics from the estimator
optimized_metrics = optimized_estimator.training_job_analytics.dataframe()

# Restructure table for viewing
for metric in optimized_metrics["metric_name"].unique():
    optimized_metrics[metric] = optimized_metrics[optimized_metrics["metric_name"] == metric][
        "value"
    ]
optimized_metrics = optimized_metrics.drop(columns=["metric_name", "value"])
optimized_metrics = optimized_metrics.groupby("timestamp").max()
optimized_metrics["epochs"] = range(1, 11)
optimized_metrics = optimized_metrics.set_index("epochs")

optimized_metrics

Savings from Training Compiler

Let us calculate the actual savings on the training jobs above and the potential for savings for a longer training job.

Actual Savings

To get the actual savings, we use the describe_training_job API to get the billable seconds for each training job.

[ ]:

# Billable seconds for the Native TensorFlow Training job

details = sess.describe_training_job(job_name=native_estimator.latest_training_job.name)
native_secs = details["BillableTimeInSeconds"]

native_secs

[ ]:

# Billable seconds for the Optimized TensorFlow Training job

details = sess.describe_training_job(job_name=optimized_estimator.latest_training_job.name)
optimized_secs = details["BillableTimeInSeconds"]

optimized_secs

[ ]:

# Calculating percentage Savings from Training Compiler

percentage = (native_secs - optimized_secs) * 100 / native_secs

f"Training Compiler yielded {percentage:.2f}% savings in training cost."

Potential savings

The Training Compiler works by compiling the model graph once per input shape and reusing the cached graph for subsequent steps. As a result the first few steps of training incur an increased latency owing to compilation which we refer to as the compilation overhead. This overhead is amortized over time thanks to the subsequent steps being much faster. We will demonstrate this below.

[ ]:

import matplotlib.pyplot as plt

plt.plot(native_metrics["training_latency_per_epoch"], label="native_epoch_latency")
plt.plot(optimized_metrics["training_latency_per_epoch"], label="optimized_epoch_latency")
plt.legend()

We calculate the potential savings below from the difference in steady state epoch latency between native TensorFlow and optimized TensorFlow

[ ]:

native_steady_state_latency = native_metrics["training_latency_per_epoch"].iloc[-1]

native_steady_state_latency

[ ]:

optimized_steady_state_latency = optimized_metrics["training_latency_per_epoch"].iloc[-1]

optimized_steady_state_latency

[ ]:

# Calculating potential percentage Savings from Training Compiler

percentage = (
    (native_steady_state_latency - optimized_steady_state_latency)
    * 100
    / native_steady_state_latency
)

f"Training Compiler can potentially yield {percentage:.2f}% savings in training cost for a longer training job."

Convergence of Training

Training Compiler brings down total training time by intelligently choosing between memory utilization and core utilization in the GPU. This does not have any effect on the model arithmetic and consequently convergence of the model.

However, since we are working with a new batch size, hyperparameters like - learning rate, learning rate schedule and weight decay might have to be scaled and tuned for the new batch size

[ ]:

import matplotlib.pyplot as plt

plt.plot(native_metrics["training_loss"], label="native_loss")
plt.plot(optimized_metrics["training_loss"], label="optimized_loss")
plt.legend()

We can see that the model’s convergence behavior is similar with and without Training Compiler. Here we have tuned the batch size specific hyperparameters - Learning Rate and Weight Decay using a linear scaling.

Learning rate is directly proportional to the batch size:

new_learning_rate = old_learning_rate * new_batch_size/old_batch_size

Weight decay is inversely proportional to the batch size:

new_weight_decay = old_weight_decay * old_batch_size/new_batch_size

Better results can be achieved with further tuning. Check out Automatic Model Tuning for tuning.

Clean up

Stop all training jobs launched if the jobs are still running.

[ ]:

def stop_training_job(name):
    status = sess.describe_training_job(name)["TrainingJobStatus"]
    if status == "InProgress":
        sm.stop_training_job(TrainingJobName=name)


stop_training_job(native_estimator.latest_training_job.name)
stop_training_job(optimized_estimator.latest_training_job.name)

Also, to find instructions on cleaning up resources, see Clean Up in the Amazon SageMaker Developer Guide.

Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.