Debugger
Examples on how to use SageMaker Debugger.
Get started with SageMaker Debugger
Debugging
Profiling
Debugging Model Parameters
You can track and debug model parameters, such as weights, gradients, biases, and scalar values of your training job. Available deep learning frameworks are Apache MXNet, TensorFlow, PyTorch, and XGBoost.
Real-time analysis of deep learning models
Apache MXNet
TensorFlow 2.x
TensorFlow 1.x
PyTorch
XGBoost
Bring your own container
Profiling System Bottlenecks and Framework Operators
Debugger provides the following profile features:
Monitoring system bottlenecks – Monitor system resource utilization rate, such as CPU, GPU, memories, network, and data I/O metrics. This is a framework and model agnostic feature and available for any training jobs in SageMaker.
Profiling deep learning framework operations – Profile deep learning operations of the TensorFlow and PyTorch frameworks, such as step durations, data loaders, forward and backward operations, Python profiling metrics, and framework-specific metrics.
Tensorflow
- Profiling TensorFlow Single GPU Single Node Training Job with Amazon SageMaker Debugger
- Profiling TensorFlow Multi GPU Multi Node Training Job with Amazon SageMaker Debugger (SageMaker SDK)
- Profiling TensorFlow Multi GPU Multi Node Training Job with Amazon SageMaker Debugger (SageMaker API)
- How to identify low GPU utilization due to small batch size
- Identify a CPU bottleneck caused by a callback process with Amazon SageMaker Debugger