SageMaker Training Heterogeneous Clusters allows you to run one training job that includes instances of different types. For example a GPU instance like ml.p4d.24xlarge and a CPU instance like c5.18xlarge.
One primary use case is offloading CPU intensive tasks like image pre-processing (data augmentation) from the GPU instance to a dedicate CPU instance, so you can fully utilize the expensive GPUs, and arrive at an improved time and cost to train.
See the following example notebooks:
This minimal example launches a Heterogeneous cluster training job, print environment information, and exit.
This example is a reusable implementation of Heterogeneous cluster with TensorFlow’s tf.data.service
This example is a reusable implementation of Heterogeneous cluster with gRPC based data loader