Use SageMaker Distributed Model Parallel with Amazon SageMaker to Launch Training Job with Model Parallelization

SageMaker Distributed Model Parallel is a model parallelism library for training large deep learning models that were previously difficult to train due to GPU memory limitations. SageMaker Distributed Model Parallel automatically and efficiently splits a model across multiple GPUs and instances and coordinates model training, allowing you to increase prediction accuracy by creating larger models with more parameters.

Use this notebook to configure Sagemaker Distributed Model Parallel to train a model using TensorFlow and Amazon SageMaker Python SDK.

Additional Resources

If you are a new user of Amazon SageMaker, you may find the following helpful to understand how SageMaker uses Docker to train custom models. * To learn more about using Amazon SageMaker with your own training image, see Use Your Own Training Algorithms.

Amazon SageMaker Initialization

Run the following cells to initialize the notebook instance. Get the SageMaker execution role used to run this notebook.

[ ]:
pip install sagemaker-experiments
[ ]:
pip install sagemaker --upgrade
[ ]:
import sagemaker
from sagemaker import get_execution_role
from sagemaker.tensorflow import TensorFlow
from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
import boto3
from time import gmtime, strftime
from datetime import datetime

role = (
)  # provide a pre-existing role ARN as an alternative to creating a new role
print(f"SageMaker Execution Role:{role}")

session = boto3.session.Session()

Prepare your training script

Run the following cells to view example-training scripts for TensorFlow versions 2.3. The is pure model paralleism and is data/model paralleism using Horovod.

[ ]:
# Run this cell to see an example of a training scripts that you can use to configure -
# SageMaker Distributed Model Parallel with TensorFlow versions 2.3
!cat utils/
[ ]:
# Run this cell to see an example of a training scripts that you can use to configure -
# SageMaker Distributed Model Parallel using Horvod with TensorFlow 2.3
!cat utils/

Define SageMaker Training Job

Next, you will use SageMaker Estimator API to define a SageMaker Training Job. You will use an `Estimator <>`__ to define the number and type of EC2 instances Amazon SageMaker uses for training, as well as the size of the volume attached to those instances.

You must update the following: * processes_per_host * entry_point * instance_count * instance_type * base_job_name

In addition, you can supply and modify configuration parameters for the SageMaker Distributed Model Parallel library. These parameters will be passed in through the distributions argument, as shown below.

Update the Type and Number of EC2 Instances Used

Pick your entry_point from one of the example scripts:,

Specify your processes_per_host, for only use 2, for use at least 4. Note that it must be a multiple of your partitions, which by default is 2.

The instance type and number of instances you specify in instance_type and instance_count respecitvely will determine the number of GPUs Amazon SageMaker uses during training. Explicitly, instance_type will determine the number of GPUs on a single instance and that number will be multiplied by instance_count.

You must specify values for instance_type and instance_count so that the total number of GPUs available for training is equal to partitions in parameters of your model parallel distributions argument in the Estimator API

If you use, the total number of model replicas your training job can support will be equal to the total number of GPUs you specify, divided by partitions. Therefore, if you use Horovod for data parallelization, specify the total number of GPUs to be the desired number of model replicas times partitions: total-model-replicas x partitions.

To look up instances types, see Amazon EC2 Instance Types.

Uploading Checkpoint During Training or Resuming Checkpoint from Previous Training

We also provide a custom way for users to upload checkpoints during training or resume checkpoints from previous training. We have integrated this into our example script. Please see the functions aws_s3_sync, sync_local_checkpoints_to_s3, and sync_s3_checkpoints_to_local. For the purpose of this example, we are only uploading a checkpoint during training, by using sync_local_checkpoints_to_s3.

After you have updated entry_point, instance_count, instance_type and base_job_name, run the following to create an estimator.

[ ]:
sagemaker_session = sagemaker.session.Session(boto_session=session)
mpioptions = "-verbose -x orte_base_help_aggregate=0 "

# choose an experiment name (only need to create it once)
experiment_name = "SM-MP-DEMO-{}".format("%Y%m%d-%H%M%S"))

all_experiment_names = [exp.experiment_name for exp in Experiment.list()]
# Load the experiment if it exists, otherwise create
if experiment_name not in all_experiment_names:
    customer_churn_experiment = Experiment.create(
        experiment_name=experiment_name, sagemaker_boto_client=boto3.client("sagemaker")
    customer_churn_experiment = Experiment.load(
        experiment_name=experiment_name, sagemaker_boto_client=boto3.client("sagemaker")

# Create a trial for the current run
trial = Trial.create(
    trial_name="SMD-MP-demo-{}".format(strftime("%Y-%m-%d-%H-%M-%S", gmtime())),

smd_mp_estimator = TensorFlow(
    entry_point="",  # Pick your train script
        "smdistributed": {
            "modelparallel": {
                "enabled": True,
                "parameters": {
                    "microbatches": 2,
                    "partitions": 2,
                    "pipeline": "interleaved",
                    "optimize": "memory",
                    # "horovod": True, #Set to True if using the horovod script
        "mpi": {
            "enabled": True,
            "processes_per_host": 2,  # Pick your processes_per_host
            "custom_mpi_options": mpioptions,

Finally, you will use the estimator to launch the SageMaker training job.

[ ]:
        "ExperimentName": customer_churn_experiment.experiment_name,
        "TrialName": trial.trial_name,
        "TrialComponentDisplayName": "Training",