Heterogeneous Clusters

SageMaker Training Heterogeneous Clusters allows you to run one training job that includes instances of different types. For example a GPU instance like ml.p4d.24xlarge and a CPU instance like c5.18xlarge.

One primary use case is offloading CPU intensive tasks like image pre-processing (data augmentation) from the GPU instance to a dedicate CPU instance, so you can fully utilize the expensive GPUs, and arrive at an improved time and cost to train.

See the following example notebooks:

Hello World

This minimal example launches a Heterogeneous cluster training job, print environment information, and exit.

TensorFlow

This example is a reusable implementation of Heterogeneous cluster with TensorFlow’s tf.data.service

PyTorch

This example is a reusable implementation of Heterogeneous cluster with gRPC based data loader