Profiling TensorFlow Multi GPU Multi Node Training Job with Amazon SageMaker Debugger (SageMaker API)

This notebook walks you through creating a TensorFlow training job with the SageMaker Debugger profiling feature enabled. It will create a multi GPU multi node training using Horovod.

(Optional) Install SageMaker and SMDebug

To use the new Debugger profiling features released in December 2020, ensure that you have the latest versions of Boto3, SageMaker, and SMDebug libraries installed. Use the following cell (switch install_needed to True) to update the libraries and restarts the Jupyter kernel to apply the updates.

[ ]:
import sys
import IPython
install_needed = False  # should only be True once
if install_needed:
    print("installing deps and restarting kernel")
    !{sys.executable} -m pip install -U boto3 sagemaker smdebug

1. Create a Training Job with Debugger Enabled

You will learn how to use the Boto3 SageMaker client’s create_training_job() function to start a training job.

Start a SageMaker session and retrieve the current region and the default Amazon S3 bucket URI

[ ]:
import sagemaker

session = sagemaker.Session()
region = session.boto_region_name
bucket = session.default_bucket()
print(region, bucket)

Upload a training script to the S3 bucket

[ ]:
import boto3, tarfile

source = "source.tar.gz"
project = "debugger-boto3-profiling-test"

tar =, "w:gz")

s3 = boto3.client("s3")
s3.upload_file(source, bucket, project + "/" + source)
[ ]:
upload_file_path = f"s3://{bucket}/{project}/{source}"

Create a Boto3 SageMaker client object

[ ]:
sm = boto3.Session(region_name=region).client("sagemaker")

Configure the request body of the create_training_job() function

The following parameters are required to include to the request body for create_training_job() function.

  • TrainingJobName - Specify a prefix or a full name if you want to modify

  • HyperParameters - Set up the following items:

  • sagemaker_program and sagemaker_submit_directory - The S3 bucket URI of the training script. This enables SageMaker to read the training script from the URI and start a training job.

  • sagemaker_mpi options - Configure these key-value pairs to set up distributed training.

  • You can also add other hyperparameters for your model.

  • AlgorithmSpecification - Specify TrainingImage. In this example, an official TensorFlow DLC image is used. You can also use your own training container images here.

  • RoleArn - To run the following cell, you must specify the right SageMaker execution role ARN that you want to use for training.

  • The DebugHookConfig and DebugRuleConfigurations are preconfigured for watching loss values and a loss not decreasing issue.

  • The ProfilerConfig and ProfilerRuleConfigurations are preconfigured to collect system and framework metrics, initiate all profiling rules, and create a Debugger profiling report.

Important: For DebugRuleConfigurations and ProfilerRuleConfigurations, to run the following cell, you must specify the right Debugger rule image URI from `Amazon SageMaker Debugger Registry URLs for Built-in Rule Evaluators <>`__. The sagemaker.debugger.get_rule_container_image_uri(region) function retrieves the Debugger rule image automatically. For example: - If you are in us-east-1, the right image URI is - If you are in us-west-2, the right image URI is

[ ]:
import datetime

training_job_name = "profiler-boto3-" +"%Y-%m-%d-%H-%M-%S")

        "sagemaker_program": "entry_point/",
        "sagemaker_submit_directory": "s3://" + bucket + "/" + project + "/" + source,
        "sagemaker_mpi_custom_mpi_options": "-verbose -x HOROVOD_TIMELINE=./hvd_timeline.json -x NCCL_DEBUG=INFO -x OMPI_MCA_btl_vader_single_copy_mechanism=none",
        "sagemaker_mpi_enabled": "true",
        "sagemaker_mpi_num_of_processes_per_host": "4",
        "TrainingImage": "763104351884.dkr.ecr."
        + region
        + "",
        "TrainingInputMode": "File",
        "EnableSageMakerMetricsTimeSeries": False,
    # You must specify your SageMaker execution role ARN here
    OutputDataConfig={"S3OutputPath": "s3://" + bucket + "/" + project + "/output"},
    ResourceConfig={"InstanceType": "ml.p3.8xlarge", "InstanceCount": 2, "VolumeSizeInGB": 30},
    StoppingCondition={"MaxRuntimeInSeconds": 86400},
        "S3OutputPath": "s3://" + bucket + "/" + project + "/debug-output",
        "CollectionConfigurations": [
            {"CollectionName": "losses", "CollectionParameters": {"train.save_interval": "50"}}
            "RuleConfigurationName": "LossNotDecreasing",
            # You must specify the correct image URI from
            "RuleEvaluatorImage": sagemaker.debugger.get_rule_container_image_uri(region),
            "RuleParameters": {"rule_to_invoke": "LossNotDecreasing"},
        "S3OutputPath": "s3://" + bucket + "/" + project + "/profiler-output",
        "ProfilingIntervalInMilliseconds": 500,
        "ProfilingParameters": {
            "DataloaderProfilingConfig": '{"StartStep": 5, "NumSteps": 3, "MetricsRegex": ".*", }',
            "DetailedProfilingConfig": '{"StartStep": 5, "NumSteps": 3, }',
            "PythonProfilingConfig": '{"StartStep": 5, "NumSteps": 3, "ProfilerName": "cprofile", "cProfileTimer": "total_time"}',
            "LocalPath": "/opt/ml/output/profiler/",  # Optional. Local path for profiling outputs
            "RuleConfigurationName": "ProfilerReport",
            # You must specify the correct image URI from
            "RuleEvaluatorImage": sagemaker.debugger.get_rule_container_image_uri(region),
            "RuleParameters": {"rule_to_invoke": "ProfilerReport"},

2. Analyze Profiling Data

Install the SMDebug client library to use Debugger analysis tools

[ ]:
import pip

def import_or_install(package):
    except ImportError:
        pip.main(["install", package])


Use SMDebug to retrieve saved output data and use analysis tools

While the training is still in progress you can visualize the performance data in SageMaker Studio or in the notebook. Debugger provides utilities to plot system metrics in form of timeline charts or heatmaps. Checkout out the notebook interactive_analysis_profiling_data.ipynb for more details. In the following code cell we plot the total CPU and GPU utilization as timeseries charts. To visualize other metrics such as I/O, memory, network you simply need to extend the list passed to select_dimension and select_events.

[ ]:
from smdebug.profiler.analysis.notebook_utils.training_job import TrainingJob

tj = TrainingJob(training_job_name, region)
[ ]:
from smdebug.profiler.analysis.notebook_utils.timeline_charts import TimelineCharts

system_metrics_reader = tj.get_systems_metrics_reader()

view_timeline_charts = TimelineCharts(
    select_dimensions=["CPU", "GPU"],

3. Download Debugger Profiling Report

The profiling report rule will create an html report profiler-report.html with a summary of builtin rules and recommenades of next steps. You can find this report in your S3 bucket.

[ ]:
rule_output_path = (
    + bucket
    + "/"
    + project
    + "/output/"
    + training_job_name
    + "/rule-output/ProfilerReport/profiler-output/"
[ ]:
! aws s3 ls {rule_output_path} --recursive
[ ]:
! aws s3 cp {rule_output_path} . --recursive
[ ]:
from IPython.display import FileLink

display("Click link below to view the profiler report", FileLink("profiler-report.html"))

Note: If you are using JupyterLab, make sure you click ``Trust HTML`` at the top left corner after you open the report.

[ ]: