# install dependencies
!pip install -Uq sagemaker

Using Amazon SageMaker Debugger for PyTorch Training Jobs

Amazon SageMaker is a managed platform to build, train and host machine learning models. Amazon SageMaker Debugger is a new feature which offers capability to debug machine learning and deep learning models during training by identifying and detecting problems with the models in real time.

Amazon SageMaker also gives you the option of bringing your own algorithms packaged in a custom container, that can then be trained and deployed in the Amazon SageMaker environment.

This notebook guides you through an example of using your own container with PyTorch for training, along with the recently added feature, Amazon SageMaker Debugger.

How does Amazon SageMaker Debugger work?

Amazon SageMaker Debugger lets you go beyond just looking at scalars like losses and accuracies during training and gives you full visibility into all tensors ‘flowing through the graph’ during training. Furthermore, it helps you monitor your training in real time using rules and CloudWatch events and react to issues like, for example, common training issues such as vanishing gradients or poor weight initialization.


  • Output Tensor: These are the artifacts that define the state of the training job at any particular instant in its lifecycle.

  • Debug Hook: Captures the tensors flowing through the training computational graph every N steps.

  • Debugging Rule: Logic to analyze the tensors captured by the hook and report anomalies.

With these concepts in mind, let’s understand the overall flow of things which Amazon SageMaker Debugger uses to orchestrate debugging.

It operates in two steps - saving tensors and analysis.

Saving tensors

Tensors that debug hook captures are stored in S3 location specified by you. There are two ways you can configure Amazon SageMaker Debugger for storage:

  1. Zero code change (DEPRECATED for PyTorch versions >= 1.12): If you use any of SageMaker provided Deep Learning containers then you don’t need to make any changes to your training script for tensors to be stored. Amazon SageMaker Debugger will use the configuration you provide in the framework Estimator to save tensors in the fashion you specify.

    Note: In case of PyTorch training, Debugger collects output tensors in GLOBAL mode by default. In other words, this option does not distinguish output tensors from different phases within an epoch, such as training phase and validation phase.

  2. Script change: Use the SageMaker Debugger client library, SMDebug, and customize training scripts to save the specific tensors you want at different frequencies and configurations. Refer to the DeveloperGuide for details on how to use SageMaker Debugger with your choice of framework in your training script.

In this notebook, we choose the second option to properly save the output tensors from different training phases since we’re using PyTorch=1.12

Analysis of tensors

Once tensors are saved, Amazon SageMaker Debugger can be configured to run debugging *Rules* on them. On a very broad level, a rule is a python script used to detect certain conditions during training. Some of the conditions that a data scientist training an algorithm might be interested in are monitoring for gradients getting too large or too small, detecting overfitting, and so on. Amazon SageMaker Debugger comes pre-packaged with certain built-in rules. You can also write your own rules using the Amazon SageMaker Debugger APIs. You can also analyze raw tensor data outside the Rules construct in a notebook, using Amazon SageMaker Debugger’s full set of APIs.

Import SageMaker Python SDK and install required packages

import sagemaker


This notebook works with the SageMaker Python SDK version 2.39.1 or later.

import pip
import sys

def import_or_install(package):
    except ImportError:
        !{sys.executable} -m pip install {package}

required_packages=['smdebug', 'pytest']

for package in required_packages:

Modify a PyTorch training script

We will focus on how to modify a training script to save tensors by registering debug hooks and specifying which tensors to save.

The model used for this notebook is trained with the MNIST dataset. The example is based on https://github.com/pytorch/examples/blob/master/mnist/main.py (the version as of October 2020).

Modifying the training script

Before we define a PyTorch estimator and start training, we will explore parts of the training script in detail. (The entire training script can be found at ./scripts/pytorch_mnist.py).

  • Step 1: Import Amazon SageMaker Debugger client library, SMDebug.

    import smdebug.pytorch as smd
  • Step 2: In the train() function, add the SMDebug hook for PyTorch with TRAIN mode.

  • Step 3: In the test() function, add the SMDebug hook for PyTorch with EVAL mode.

  • Step 4: In the main() function, create the SMDebug hook and register to the model and loss function.

    hook = smd.Hook.create_from_json_file()
  • Step 4: In the main() function, pass the SMDebug hook to the train() and test() functions in the epoch loop.

    train(args, model, loss_fn, device, train_loader, optimizer, epoch, hook)
    test(model, device, loss_fn, test_loader, hook)
!pygmentize ./scripts/pytorch_mnist.py

Set up a PyTorch estimator and run a training job

Once these changes are made in the training script, Amazon SageMaker Debugger will start saving tensors during training into a specified output S3 bucket.

Now, we will set up the estimator and start training using the modified training script.

from __future__ import absolute_import

import boto3
import pytest
from sagemaker.pytorch import PyTorch
from sagemaker import get_execution_role
from sagemaker.debugger import (

Define the configuration of training to run. ecr_image is where you can provide link to your bring-your-own-container. hyperparameters are fed into the training script with data directory (directory where the training dataset is stored) and smdebug directory (directory where the tensors will be saved) are mandatory fields.

hyperparameters = {"epochs": "5", "batch-size": "32", "test-batch-size": "100", "lr": "0.001"}

Configure a Debugger rule object

The rules parameter is a new parameter that accepts a list of rules against output tensors that you want to evaluate.

In this example, we use the following Debugger rules that will attempt to evaluate if there are overfit, overtraining, and vanishing gradients problems.

rules = [

For more information about the rules, see the following documentation.

In addition to the model debugging rules above, SageMaker Debugger runs the ProfilerReport rule by default. This runs rules for system bottleneck detections and autogenerates a profiling report. For more information, see the following documentation:

Configure Debugger hook parameters

The following code shows how to adjust save intervals of the output tensors in the different training phases.

hook_config = DebuggerHookConfig(
    hook_parameters={"train.save_interval": "100", "eval.save_interval": "10"}

Construct a PyTorch estimator with the Debugger parameters

estimator = PyTorch(
    ## Debugger parameters

Start the training job


Check SageMaker Debugger rule summaries

As a result of calling the fit() method, Amazon SageMaker Debugger starts a rule evaluation job to monitor vanishing_gradient(), overfit(), and overtraining() issues in parallel with the training job.

The ProfilerReport rule runs for all SageMaker training jobs by default. You will be able to receive a comprehensive training report regarding system bottlenecks and framework profiling.

SageMaker Debugger reports and analysis

Another aspect of the Amazon SageMaker Debugger is analysis. It allows us to perform interactive exploration of the tensors saved in real time or after the job. Here we focus on after-the-fact analysis of the above job. We import the smdebug library, which defines a concept of Trial that represents a single training run. Note how we fetch the path to debugger artifacts for the above job.

Create an SMDebug trial object and retrieve saved output tensors

[ ]:
from smdebug.trials import create_trial
from smdebug.core.modes import ModeKeys

trial = create_trial(estimator.latest_job_debugger_artifacts_path())

Check the number of steps saved in the different training phases

[ ]:
[ ]:

Set up functions to log and plot the output tensors

[ ]:
def get_data(trial, tname, mode):
    tensor = trial.tensor(tname)
    steps = tensor.steps(mode=mode)
    vals = []
    for s in steps:
        vals.append(tensor.value(s, mode=mode))
    return steps, vals
[ ]:
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import host_subplot

def plot_tensor(trial, tensor_name):

    steps_train, vals_train = get_data(trial, tensor_name, mode=ModeKeys.TRAIN)
    print("loaded TRAIN data")
    steps_eval, vals_eval = get_data(trial, tensor_name, mode=ModeKeys.EVAL)
    print("loaded EVAL data")

    fig = plt.figure(figsize=(10, 7))
    host = host_subplot(111)

    par = host.twiny()

    host.set_xlabel("Steps (TRAIN)")
    par.set_xlabel("Steps (EVAL)")

    (p1,) = host.plot(steps_train, vals_train, label=tensor_name)
    print("completed TRAIN plot")
    (p2,) = par.plot(steps_eval, vals_eval, label="val_" + tensor_name)
    print("completed EVAL plot")
    leg = plt.legend()




[ ]:
plot_tensor(trial, "NLLLoss_output_0")

Reflect the rule summary report

Recall what the rule summary reported:

Overfit :  IssuesFound
RuleEvaluationConditionMet: Evaluation of the rule Overfit at step 4000 resulted in the condition being met

Based on this rule evaluation and the plot above, we can conclude that the training job has an overfit issue. While the NLLLoss_output_0 line is decreasing, the val_NLLLoss_output_0 line is fluctuating and not decreasing.

To resolve the overfit problem, you need to consider using or double-checking the following techniques:

  • Regularization

  • Weight initialization

  • Dropout regularization

  • Weight constraints

Download, open, and display the ProfilerReport HTML file

[ ]:
rule_output_path = estimator.output_path + estimator.latest_training_job.job_name + "/rule-output"
[ ]:
! aws s3 ls {rule_output_path} --recursive
[ ]:
! aws s3 cp {rule_output_path} ./ --recursive
[ ]:
import os

# get the autogenerated folder name of profiler report
profiler_report_name = [
    for rule in estimator.latest_training_job.rule_job_summary()
    if "Profiler" in rule["RuleConfigurationName"]
[ ]:
import IPython

IPython.display.HTML(filename=profiler_report_name + "/profiler-output/profiler-report.html")