[ ]:
!pip install -qU horovod
Identify a CPU bottleneck caused by a callback process with Amazon SageMaker Debugger
In this notebook we demonstrate how to identify a training bottleneck that is caused by a TensorFlow Keras callback. To simulate this type of bottleneck, we will program the callback associated with the tensor monitoring feature of Amazon SageMaker Debugger, to collect an excessive number of tensors, and at a high frequency.
Install sagemaker
To use the new Debugger profiling features, ensure that you have the latest version of SageMaker SDK installed. The following cell updates the library and restarts the Jupyter kernel to apply the updates.
[ ]:
import sys
import IPython
install_needed = False # should only be True once
if install_needed:
print("installing deps and restarting kernel")
!{sys.executable} -m pip install -U sagemaker
IPython.Application.instance().kernel.do_shutdown(True)
1. Prepare training dataset
Tensorflow Datasets package
First of all, set the notebook kernel to Tensorflow 2.x.
We will use CIFAR-10 dataset for this experiment. To download CIFAR-10 datasets and convert it into TFRecord format, install tensorflow-datasets
package, run demo/generate_cifar10_tfrecords
, and upload tfrecord files to your S3 bucket.
[ ]:
!python demo/generate_cifar10_tfrecords.py --data-dir=./data
[ ]:
import sagemaker
s3_bucket = sagemaker.Session().default_bucket()
dataset_prefix = "data/cifar10-tfrecords"
desired_s3_uri = f"s3://{s3_bucket}/{dataset_prefix}"
dataset_location = sagemaker.s3.S3Uploader.upload(local_path="data", desired_s3_uri=desired_s3_uri)
print(f"Dataset uploaded to {dataset_location}")
2. Create a Training Job with Profiling Enabled
We will use the standard SageMaker Estimator API for Tensorflow to create a training job. To enable profiling, we create a ProfilerConfig
object and pass it to the profiler_config
parameter of the TensorFlow
estimator. For this demo, we set the the profiler to probe the system once every 500 miliseconds.
Set a profiler configuration
[ ]:
from sagemaker.debugger import ProfilerConfig, FrameworkProfile
profiler_config = ProfilerConfig(
system_monitor_interval_millis=500,
framework_profile_params=FrameworkProfile(
local_path="/opt/ml/output/profiler/", start_step=5, num_steps=2
),
)
Configure Debugger hook
We configure the debugger hook to collect an excessive number of tensors, every 50 steps.
[ ]:
import os
from sagemaker.debugger import DebuggerHookConfig, CollectionConfig
debugger_hook_config = DebuggerHookConfig(
hook_parameters={"save_interval": "50"},
collection_configs=[
CollectionConfig(name="outputs"),
CollectionConfig(name="gradients"),
CollectionConfig(name="weights"),
CollectionConfig(name="layers"),
],
)
Define hyperparameters
The start-up script is set to train_tf_bottleneck.py. Define hyperparameters such as number of epochs, and batch size.
[ ]:
hyperparameters = {"epoch": 2, "batch_size": 128}
Get the image URI
The image that we will is dependent on the region that you are running this notebook in.
[ ]:
import boto3
session = boto3.session.Session()
region = session.region_name
image_uri = f"763104351884.dkr.ecr.{region}.amazonaws.com/tensorflow-training:2.3.1-gpu-py37-cu110-ubuntu18.04"
Define SageMaker Tensorflow Estimator
To enable profiling, you need to pass the Debugger profiling configuration (profiler_config
), a list of Debugger rules (rules
), and the image URI (image_uri
) to the estimator. Debugger enables monitoring and profiling while the SageMaker estimator requests a training job.
[ ]:
import sagemaker
from sagemaker.tensorflow import TensorFlow
job_name = "network-bottleneck"
instance_count = 1
instance_type = "ml.p2.xlarge"
entry_script = "train_tf_bottleneck.py"
estimator = TensorFlow(
role=sagemaker.get_execution_role(),
image_uri=image_uri,
base_job_name=job_name,
instance_type=instance_type,
instance_count=instance_count,
entry_point=entry_script,
source_dir="demo",
profiler_config=profiler_config,
debugger_hook_config=debugger_hook_config,
script_mode=True,
hyperparameters=hyperparameters,
input_mode="Pipe",
)
If you see an error,
TypeError: __init__() got an unexpected keyword argument 'instance_type'
, that means SageMaker Python SDK is out-dated. Please update your SageMaker Python SDK to 2.x by executing the below command and restart this notebook.
pip install --upgrade sagemaker
Start training job
The following estimator.fit()
with wait=False
argument initiates the training job in the background. You can proceed to run the dashboard or analysis notebooks.
[ ]:
remote_inputs = {"train": dataset_location + "/train"}
estimator.fit(remote_inputs, wait=True)
3. Monitor the system resource utilization using SageMaker Studio
SageMaker Studio provides the visualization tool for Sagemaker Debugger where you can find the analysis report and the system and framework resource utilization history.
To access this information in SageMaker Studio, click on the last icon on the left to open SageMaker Components and registries
and choose Experiments and trials
. You will see the list of training jobs. Right click on the job you want to investigate shows a pop-up menu, then click on Open Debugger for insights
which opens a new tab for SageMaker Debugger.
There are two tabs, Overview
and Nodes
. Overview
gives profiling summaries for quick review, and Nodes
gives a detailed utilization information on all nodes.
4. SageMaker Debugger profiling analysis utilities
We can use the profiling analysis utilities to gain deeper insights into what the source of the issue is. For this step, we will rely on the bokeh and smdebug packages
[ ]:
! pip install bokeh==2.1.1
! pip install smdebug==1.0.3
Use smdebug to extract gpu and framework metrics
[ ]:
import boto3
from smdebug.profiler.analysis.notebook_utils.training_job import TrainingJob
from smdebug.profiler.analysis.utils.profiler_data_to_pandas import PandasFrame
training_job_name = estimator.latest_training_job.name
region = boto3.Session().region_name
tj = TrainingJob(training_job_name, region)
pf = PandasFrame(tj.profiler_s3_output_path)
# extract gpu metrics
system_metrics_df = pf.get_all_system_metrics()
gpus = system_metrics_df[system_metrics_df["dimension"] == "GPUUtilization"]
timestamps = gpus["timestamp_us"].to_numpy()
values = gpus["value"].to_numpy()
# exctract framework metrics
framework_metrics_df = pf.get_all_framework_metrics(
selected_framework_metrics=["Step:ModeKeys.TRAIN", "Step:ModeKeys.GLOBAL"]
)
train_steps = framework_metrics_df[
framework_metrics_df["framework_metric"].isin(["Step:ModeKeys.TRAIN", "Step:ModeKeys.GLOBAL"])
]
start_step = train_steps["start_time_us"].to_numpy()
end_step = train_steps["end_time_us"].to_numpy()
step_num = train_steps["step"].to_numpy()
[ ]:
Use bokeh to plot the gpu metrics and the training progression on the same graph. This enables us to correlate between the two. We can see that the drops in gpu utilization coincide with every 50th step, which are marked in yellow. These are precisely the steps in which we have chosen to capture all of the graph tensors.
[ ]:
import numpy as np
from bokeh.models import ColumnDataSource, CustomJS, Div, HoverTool, HBar
from bokeh.models.glyphs import Circle, Line
from bokeh.plotting import figure, show
plot = figure(
plot_height=400,
plot_width=1400,
x_range=(timestamps[0], timestamps[-1]),
y_range=(-1, 110),
tools="crosshair,xbox_select,pan,reset,save,xwheel_zoom",
)
x_range = plot.x_range
plot.xgrid.visible = False
plot.ygrid.visible = False
colors = np.where(step_num % 50 == 0, "yellow", "purple")
# pad framework metrics to match length of system metrics
pad = values.size - step_num.size
source = ColumnDataSource(
data=dict(
x=timestamps,
y=values,
left=np.pad(start_step, (0, pad)),
right=np.pad(end_step, (0, pad)),
color=np.pad(colors, (0, pad)),
)
)
callback = CustomJS(
args=dict(s1=source, div=Div(width=250, height=100, height_policy="fixed")),
code="""
console.log('Running CustomJS callback now.');
var inds = s1.selected.indices;
console.log(inds);
var line = "<span style=float:left;clear:left;font_size=13px><b> Selected index range: [" + Math.min.apply(Math,inds) + "," + Math.max.apply(Math,inds) + "]</b></span>\\n";
console.log(line)
var text = div.text.concat(line);
var lines = text.split("\\n")
if (lines.length > 35)
lines.shift();
div.text = lines.join("\\n");""",
)
plot.js_on_event("selectiongeometry", callback)
line = Line(x="x", y="y", line_color="white")
circle = Circle(x="x", y="y", fill_alpha=0, line_width=0)
hbar = HBar(
y=105, height=5, right="right", left="left", fill_color="color", line_cap="round", line_width=0
)
p = plot.add_glyph(source, line)
p = plot.add_glyph(source, circle)
p = plot.add_glyph(source, hbar)
# create tooltip for hover tool
hover = HoverTool(renderers=[p], tooltips=[("index", "$index"), ("(x,y)", "($x, $y)")])
plot.xaxis.axis_label = "Time in ms"
plot.yaxis.axis_label = "GPU Utilization"
plot.add_tools(hover)
show(plot, notebook_handle=True)