Heterogeneous Cluster - a hello world training job


This notebook’s CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

This us-west-2 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable


This basic example on how to run a Heterogeneous Clusters training job consisting of two instance groups. Each instance group includes a different instance type. Each instance prints its environment information including its instance group and exits.

You can retrieve environment information in either of the following ways: - Option 1: Read instance group information using the convenient sagemaker_training.environment.Environment class. - Option 2: Read instance group information from /opt/ml/input/config/resourceconfig.json.

Note: This notebook does not demonstrate offloading of data preprocessing job to data group and deep neural network training to dnn_group. We will cover those examples in TensorFlow’s tf.data.service based Amazon SageMaker Heterogeneous Clusters for training and PyTorch and gRPC distributed dataloader based Amazon SageMaker Heterogeneous Clusters for training notebooks.

A. Setting up SageMaker Studio notebook

Before you start

Ensure you have selected Python 3 (TensorFlow 2.6 Python 3.8 CPU Optimized) image for your SageMaker Studio Notebook instance, and running on ml.t3.medium instance type.

Step 1 - Upgrade SageMaker SDK and dependent packages

Heterogeneous Clusters for Amazon SageMaker model training was announced on 07/08/2022. This feature release requires you to have updated SageMaker SDK and boto3 client libraries.

[ ]:
%%bash
python3 -m pip install --upgrade boto3 botocore awscli sagemaker

Step 2 - Restart the notebook kernel

[ ]:
# import IPython
# IPython.Application.instance().kernel.do_shutdown(True)

Step 3 - Validate SageMaker Python SDK and TensorFlow versions

Ensure the output of the cell below reflects:

  • SageMaker Python SDK version 2.98.0 or above,

  • boto3 1.24 or above

  • botocore 1.27 or above

  • TensorFlow 2.6 or above

[ ]:
!pip show sagemaker boto3 botocore tensorflow protobuf |egrep 'Name|Version|---'

B. Run a heterogeneous cluster training job

Step 1: Set up training environment

Import the required libraries that enable you to use Heterogeneous clusters for training. In this step, you are also inheriting this notebook’s IAM role and SageMaker session.

[ ]:
import os
import json
import datetime

import sagemaker
from sagemaker import get_execution_role
from sagemaker.tensorflow import TensorFlow
from sagemaker.instance_group import InstanceGroup

sess = sagemaker.Session()
role = get_execution_role()

Step 2: Define instance groups

Here we define instance groups. Each instance group includes a different instance type.

[ ]:
data_group = InstanceGroup("data_group", "ml.c5.xlarge", 1)
dnn_group = InstanceGroup("dnn_group", "ml.m4.xlarge", 1)

Step 3: Review the “hello world” training code

[ ]:
!pygmentize source_dir/train.py

Step 4: Configure the Estimator

In order to use SageMaker to fit our algorithm, we’ll create an Estimator that defines how to use the container to train. This includes the configuration we need to invoke SageMaker training.

[ ]:
estimator = TensorFlow(
    entry_point="train.py",
    source_dir="./source_dir",
    # instance_type='ml.m4.xlarge',
    # instance_count=1,
    instance_groups=[
        data_group,
        dnn_group,
    ],
    framework_version="2.9",
    py_version="py39",
    role=role,
    volume_size=10,
    max_run=3600,
    disable_profiler=True,
)

Step 5: Submit the training job

Here you are submitting the heterogeneous cluster training job.

[ ]:
estimator.fit(
    job_name="hello-world-heterogenous"
    + "-"
    + datetime.datetime.utcnow().strftime("%Y%m%dT%H%M%SZ"),
)

Step 6: Review the logs for environment information

Wait for the training job to finish, and review its logs in the AWS Console (click on View logs from the Training Jobs node in Amazon SageMaker Console) You’ll find two logs: Algo1, Algo2. Examine the printouts on each node on how to retrieve instance group environment information. An example is shown here:

Option-1: Read instance group information from the sagemaker_training.environment.Environment class
env.is_hetero: True
env.current_host: algo-1
env.current_instance_type: ml.c5.xlarge
env.current_instance_group: data_group
env.current_instance_group_hosts: ['algo-1']
env.instance_groups: ['data_group', 'dnn_group']

Option-2: Read instance group information from {file_path}.            You'll need to parse the json yourself. This doesn't require an additional library.
/opt/ml/input/config/resourceconfig.json dump = {
    "current_group_name": "data_group",
    "current_host": "algo-1",
    "current_instance_type": "ml.c5.xlarge",
    "hosts": [
        "algo-1",
        "algo-2"
    ],
    "instance_groups": [
        {
            "hosts": [
                "algo-1"
            ],
            "instance_group_name": "data_group",
            "instance_type": "ml.c5.xlarge"
        },
        {
            "hosts": [
                "algo-2"
            ],
            "instance_group_name": "dnn_group",
            "instance_type": "ml.m4.xlarge"
        }
    ],
    "network_interface_name": "eth0"
}
env.is_hetero: True
current_host=algo-1
current_instance_type=ml.c5.xlarge
env.current_instance_group: data_group
env.current_instance_group_hosts: TODO
env.instance_groups: TODO
env.instance_groups_dict: [{'instance_group_name': 'data_group', 'instance_type': 'ml.c5.xlarge', 'hosts': ['algo-1']}, {'instance_group_name': 'dnn_group', 'instance_type': 'ml.m4.xlarge', 'hosts': ['algo-2']}]
env.distribution_hosts: TODO
env.distribution_instance_groups: TODO

C. Next steps

In this notebook, we demonstrated how to retrieve the environment information, and differentiate which instance group an instance belongs to. Based on this, you can build logic to offload data processing tasks in your training job to a dedicated instance group. To understand how that can be done with a real-world example, we suggest going through the following notebook examples:

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.

This us-east-1 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This us-east-2 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This us-west-1 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This ca-central-1 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This sa-east-1 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This eu-west-1 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This eu-west-2 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This eu-west-3 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This eu-central-1 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This eu-north-1 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This ap-southeast-1 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This ap-southeast-2 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This ap-northeast-1 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This ap-northeast-2 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This ap-south-1 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable