[ ]:
!pip install -qU horovod

Identify a CPU bottleneck caused by a callback process with Amazon SageMaker Debugger

In this notebook we demonstrate how to identify a training bottleneck that is caused by a TensorFlow Keras callback. To simulate this type of bottleneck, we will program the callback associated with the tensor monitoring feature of Amazon SageMaker Debugger, to collect an excessive number of tensors, and at a high frequency.

Install sagemaker

To use the new Debugger profiling features, ensure that you have the latest version of SageMaker SDK installed. The following cell updates the library and restarts the Jupyter kernel to apply the updates.

[ ]:
import sys
import IPython
install_needed = False  # should only be True once
if install_needed:
    print("installing deps and restarting kernel")
    !{sys.executable} -m pip install -U sagemaker

1. Prepare training dataset

Tensorflow Datasets package

First of all, set the notebook kernel to Tensorflow 2.x.

We will use CIFAR-10 dataset for this experiment. To download CIFAR-10 datasets and convert it into TFRecord format, install tensorflow-datasets package, run demo/generate_cifar10_tfrecords, and upload tfrecord files to your S3 bucket.

[ ]:
!python demo/generate_cifar10_tfrecords.py --data-dir=./data
[ ]:
import sagemaker

s3_bucket = sagemaker.Session().default_bucket()

dataset_prefix = "data/cifar10-tfrecords"
desired_s3_uri = f"s3://{s3_bucket}/{dataset_prefix}"

dataset_location = sagemaker.s3.S3Uploader.upload(local_path="data", desired_s3_uri=desired_s3_uri)
print(f"Dataset uploaded to {dataset_location}")

2. Create a Training Job with Profiling Enabled

We will use the standard SageMaker Estimator API for Tensorflow to create a training job. To enable profiling, we create a ProfilerConfig object and pass it to the profiler_config parameter of the TensorFlow estimator. For this demo, we set the the profiler to probe the system once every 500 miliseconds.

Set a profiler configuration

[ ]:
from sagemaker.debugger import ProfilerConfig, FrameworkProfile

profiler_config = ProfilerConfig(
        local_path="/opt/ml/output/profiler/", start_step=5, num_steps=2

Configure Debugger hook

We configure the debugger hook to collect an excessive number of tensors, every 50 steps.

[ ]:
import os
from sagemaker.debugger import DebuggerHookConfig, CollectionConfig

debugger_hook_config = DebuggerHookConfig(
    hook_parameters={"save_interval": "50"},

Define hyperparameters

The start-up script is set to train_tf_bottleneck.py. Define hyperparameters such as number of epochs, and batch size.

[ ]:
hyperparameters = {"epoch": 2, "batch_size": 128}

Get the image URI

The image that we will is dependent on the region that you are running this notebook in.

[ ]:
import boto3

session = boto3.session.Session()
region = session.region_name

image_uri = f"763104351884.dkr.ecr.{region}.amazonaws.com/tensorflow-training:2.3.1-gpu-py37-cu110-ubuntu18.04"

Define SageMaker Tensorflow Estimator

To enable profiling, you need to pass the Debugger profiling configuration (profiler_config), a list of Debugger rules (rules), and the image URI (image_uri) to the estimator. Debugger enables monitoring and profiling while the SageMaker estimator requests a training job.

[ ]:
import sagemaker
from sagemaker.tensorflow import TensorFlow

job_name = "network-bottleneck"
instance_count = 1
instance_type = "ml.p2.xlarge"
entry_script = "train_tf_bottleneck.py"

estimator = TensorFlow(

If you see an error, TypeError: __init__() got an unexpected keyword argument 'instance_type', that means SageMaker Python SDK is out-dated. Please update your SageMaker Python SDK to 2.x by executing the below command and restart this notebook.

pip install --upgrade sagemaker

Start training job

The following estimator.fit() with wait=False argument initiates the training job in the background. You can proceed to run the dashboard or analysis notebooks.

[ ]:
remote_inputs = {"train": dataset_location + "/train"}

estimator.fit(remote_inputs, wait=True)

3. Monitor the system resource utilization using SageMaker Studio

SageMaker Studio provides the visualization tool for Sagemaker Debugger where you can find the analysis report and the system and framework resource utilization history.

To access this information in SageMaker Studio, click on the last icon on the left to open SageMaker Components and registries and choose Experiments and trials. You will see the list of training jobs. Right click on the job you want to investigate shows a pop-up menu, then click on Open Debugger for insights which opens a new tab for SageMaker Debugger.

There are two tabs, Overview and Nodes. Overview gives profiling summaries for quick review, and Nodes gives a detailed utilization information on all nodes.

4. SageMaker Debugger profiling analysis utilities

We can use the profiling analysis utilities to gain deeper insights into what the source of the issue is. For this step, we will rely on the bokeh and smdebug packages

[ ]:
! pip install bokeh==2.1.1
! pip install smdebug==1.0.3

Use smdebug to extract gpu and framework metrics

[ ]:
import boto3
from smdebug.profiler.analysis.notebook_utils.training_job import TrainingJob
from smdebug.profiler.analysis.utils.profiler_data_to_pandas import PandasFrame

training_job_name = estimator.latest_training_job.name
region = boto3.Session().region_name

tj = TrainingJob(training_job_name, region)

pf = PandasFrame(tj.profiler_s3_output_path)

# extract gpu metrics
system_metrics_df = pf.get_all_system_metrics()
gpus = system_metrics_df[system_metrics_df["dimension"] == "GPUUtilization"]
timestamps = gpus["timestamp_us"].to_numpy()
values = gpus["value"].to_numpy()

# exctract framework metrics
framework_metrics_df = pf.get_all_framework_metrics(
    selected_framework_metrics=["Step:ModeKeys.TRAIN", "Step:ModeKeys.GLOBAL"]
train_steps = framework_metrics_df[
    framework_metrics_df["framework_metric"].isin(["Step:ModeKeys.TRAIN", "Step:ModeKeys.GLOBAL"])
start_step = train_steps["start_time_us"].to_numpy()
end_step = train_steps["end_time_us"].to_numpy()
step_num = train_steps["step"].to_numpy()
[ ]:

Use bokeh to plot the gpu metrics and the training progression on the same graph. This enables us to correlate between the two. We can see that the drops in gpu utilization coincide with every 50th step, which are marked in yellow. These are precisely the steps in which we have chosen to capture all of the graph tensors. bokeh-graph

[ ]:
import numpy as np
from bokeh.models import ColumnDataSource, CustomJS, Div, HoverTool, HBar
from bokeh.models.glyphs import Circle, Line
from bokeh.plotting import figure, show

plot = figure(
    x_range=(timestamps[0], timestamps[-1]),
    y_range=(-1, 110),
x_range = plot.x_range

plot.xgrid.visible = False
plot.ygrid.visible = False

colors = np.where(step_num % 50 == 0, "yellow", "purple")

# pad framework metrics to match length of system metrics
pad = values.size - step_num.size
source = ColumnDataSource(
        left=np.pad(start_step, (0, pad)),
        right=np.pad(end_step, (0, pad)),
        color=np.pad(colors, (0, pad)),

callback = CustomJS(
    args=dict(s1=source, div=Div(width=250, height=100, height_policy="fixed")),
        console.log('Running CustomJS callback now.');
        var inds = s1.selected.indices;
        var line = "<span style=float:left;clear:left;font_size=13px><b> Selected index range: [" + Math.min.apply(Math,inds) + "," + Math.max.apply(Math,inds) + "]</b></span>\\n";
        var text = div.text.concat(line);
        var lines = text.split("\\n")
        if (lines.length > 35)
        div.text = lines.join("\\n");""",

plot.js_on_event("selectiongeometry", callback)

line = Line(x="x", y="y", line_color="white")
circle = Circle(x="x", y="y", fill_alpha=0, line_width=0)
hbar = HBar(
    y=105, height=5, right="right", left="left", fill_color="color", line_cap="round", line_width=0

p = plot.add_glyph(source, line)
p = plot.add_glyph(source, circle)
p = plot.add_glyph(source, hbar)

# create tooltip for hover tool
hover = HoverTool(renderers=[p], tooltips=[("index", "$index"), ("(x,y)", "($x, $y)")])

plot.xaxis.axis_label = "Time in ms"
plot.yaxis.axis_label = "GPU Utilization"
show(plot, notebook_handle=True)