Distributed Training

The SageMaker built-in libraries of algorithms consists of 18 popular machine learning algorithms. Many of them were rewritten from scratch to be scalable and distributed out of the box. If you want to use distributed deep learning training code, we recommend Amazon SageMaker’s distributed training libraries. SageMaker’s distributed training libraries make it easier for you to write highly scalable and cost-effective custom data parallel and model parallel deep learning training jobs.

SageMaker distributed training libraries offer both data-parallel and model-parallel training strategies. It combines software and hardware technologies to improve inter-GPU and inter-node communications. It extends SageMaker’s training capabilities with built-in options that require only small code changes to your training scripts.

To learn how, try one of the notebooks in the following framework sections.

Frameworks

Apache MXNet
PyTorch
TensorFlow2

SageMaker distributed data parallel

SageMaker distributed data parallel (SDP) extends SageMaker’s training capabilities on deep learning models with near-linear scaling efficiency, achieving fast time-to-train with minimal code changes.

SDP optimizes your training job for AWS network infrastructure and EC2 instance topology.

SDP takes advantage of gradient update to communicate between nodes with a custom AllReduce algorithm.

When training a model on a large amount of data, machine learning practitioners will often turn to distributed training to reduce the time to train. In some cases, where time is of the essence, the business requirement is to finish training as quickly as possible or at least within a constrained time period. Then, distributed training is scaled to use a cluster of multiple nodes, meaning not just multiple GPUs in a computing instance, but multiple instances with multiple GPUs. As the cluster size increases, so does the significant drop in performance. This drop in performance is primarily caused by the communications overhead between nodes in a cluster.

SageMaker distributed (SMD) offers two options for distributed training: SageMaker model parallel (SMP) and SageMaker data parallel (SDP). This guide focuses on how to train models using a data parallel strategy. For more information on training with a model parallel strategy, refer to SageMaker distributed model parallel.

More resources:

SageMaker distributed model parallel

Amazon SageMaker distributed model parallel (SMP) is a model parallelism library for training large deep learning models that were previously difficult to train due to GPU memory limitations. SMP automatically and efficiently splits a model across multiple GPUs and instances and coordinates model training, allowing you to increase prediction accuracy by creating larger models with more parameters.

You can use SMP to automatically partition your existing TensorFlow and PyTorch workloads across multiple GPUs with minimal code changes. The SMP API can be accessed through the Amazon SageMaker SDK.

Use the following sections to learn more about the model parallelism and the SMP library.

More resources:

MPI

Use MPI on SageMaker

Introduction to MPI on Amazon SageMaker

PyTorch

SageMaker distributed data parallel (SDP)

SageMaker distributed model parallel (SMP)

TensorFlow2

SageMaker distributed data parallel (SDP)

SageMaker distributed model parallel (SMP)

Use SageMaker Distributed Model Parallel with Amazon SageMaker to Launch Training Job with Model Parallelization

Horovod

Train and Host a Keras Model with Pipe Mode and Horovod on Amazon SageMaker

Apache MXNet

Horovod

In addition to the notebook, this topic is covered in this workshop topic: Parallelized data distribution (sharding)