Profiling TensorFlow Single GPU Single Node Training Job with Amazon SageMaker Debugger


This notebook’s CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

This us-west-2 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable


This notebook will walk you through creating a TensorFlow training job with the SageMaker Debugger profiling feature enabled. It will create a single GPU single node training.

Install sagemaker and smdebug

To use the new Debugger profiling features, ensure that you have the latest versions of SageMaker and SMDebug SDKs installed. The following cell updates the libraries and restarts the Jupyter kernel to apply the updates.

[ ]:
import sys
import IPython
install_needed = True  # should only be True once
if install_needed:
    print("installing deps and restarting kernel")
    !{sys.executable} -m pip install -U sagemaker
    !{sys.executable} -m pip install -U smdebug

1. Create a Training Job with Profiling Enabled

You will use the standard SageMaker Estimator API for Tensorflow to create training jobs. To enable profiling, create a ProfilerConfig object and pass it to the profiler_config parameter of the TensorFlow estimator.

Define hyperparameters

Define hyperparameters such as number of epochs, batch size, and data augmentation. You can increase batch size to increases system utilization, but it may result in CPU bottlneck problems. Data preprocessing of a large batch size with augmentation requires a heavy computation. You can disable data_augmentation to see the impact on the system utilization.

For demonstration purpose, the following hyperparameters are prepared to increase CPU usage, leading to GPU starvation.

[ ]:
hyperparameters = {"epoch": 5, "batch_size": 64, "data_augmentation": True}

Configure rules

We specify the following rules: - loss_not_decreasing: checks if loss is decreasing and triggers if the loss has not decreased by a certain persentage in the last few iterations - LowGPUUtilization: checks if GPU is under-utilizated - ProfilerReport: runs the entire set of performance rules and create a final output report with further insights and recommendations.

[ ]:
from sagemaker.debugger import Rule, ProfilerRule, rule_configs

rules = [
    Rule.sagemaker(rule_configs.loss_not_decreasing()),
    ProfilerRule.sagemaker(rule_configs.LowGPUUtilization()),
    ProfilerRule.sagemaker(rule_configs.ProfilerReport()),
]

Specify a profiler configuration

The following configuration will capture system metrics at 500 milliseconds. The system metrics include utilization per CPU, GPU, memory utilization per CPU, GPU as well I/O and network.

Debugger will capture detailed profiling information from step 5 to step 15. This information includes Horovod metrics, dataloading, preprocessing, operators running on CPU and GPU.

[ ]:
from sagemaker.debugger import ProfilerConfig, FrameworkProfile

profiler_config = ProfilerConfig(
    system_monitor_interval_millis=500,
    framework_profile_params=FrameworkProfile(
        local_path="/opt/ml/output/profiler/", start_step=5, num_steps=10
    ),
)

Get the image URI

The image that we will is dependent on the region that you are running this notebook in.

[ ]:
import boto3

session = boto3.session.Session()
region = session.region_name

image_uri = f"763104351884.dkr.ecr.{region}.amazonaws.com/tensorflow-training:2.3.1-gpu-py37-cu110-ubuntu18.04"

Define estimator

To enable profiling, you need to pass the Debugger profiling configuration (profiler_config), a list of Debugger rules (rules), and the image URI (image_uri) to the estimator. Debugger enables monitoring and profiling while the SageMaker estimator requests a training job.

[ ]:
import sagemaker
from sagemaker.tensorflow import TensorFlow

estimator = TensorFlow(
    role=sagemaker.get_execution_role(),
    image_uri=image_uri,
    instance_count=1,
    instance_type="ml.p3.8xlarge",
    entry_point="train_tf.py",
    source_dir="entry_point",
    profiler_config=profiler_config,
    hyperparameters=hyperparameters,
    rules=rules,
)

Start training job

The following estimator.fit() with wait=False argument initiates the training job in the background. You can proceed to run the dashboard or analysis notebooks.

[ ]:
estimator.fit(wait=False)

2. Analyze Profiling Data

Copy outputs of the following cell (training_job_name and region) to run the analysis notebooks profiling_generic_dashboard.ipynb, analyze_performance_bottlenecks.ipynb, and profiling_interactive_analysis.ipynb.

[ ]:
training_job_name = estimator.latest_training_job.name
print(f"Training jobname: {training_job_name}")
print(f"Region: {region}")

While the training is still in progress you can visualize the performance data in SageMaker Studio or in the notebook. Debugger provides utilities to plot system metrics in form of timeline charts or heatmaps. Checkout out the notebook profiling_interactive_analysis.ipynb for more details. In the following code cell we plot the total CPU and GPU utilization as timeseries charts. To visualize other metrics such as I/O, memory, network you simply need to extend the list passed to select_dimension and select_events.

[ ]:
from smdebug.profiler.analysis.notebook_utils.training_job import TrainingJob

tj = TrainingJob(training_job_name, region)
tj.wait_for_sys_profiling_data_to_be_available()
[ ]:
from smdebug.profiler.analysis.notebook_utils.timeline_charts import TimelineCharts

system_metrics_reader = tj.get_systems_metrics_reader()
system_metrics_reader.refresh_event_file_list()

view_timeline_charts = TimelineCharts(
    system_metrics_reader,
    framework_metrics_reader=None,
    select_dimensions=["CPU", "GPU"],
    select_events=["total"],
)

3. Download Debugger Profiling Report

The profiling report rule will create an html report profiler-report.html with a summary of builtin rules and recommenades of next steps. You can find this report in your S3 bucket.

[ ]:
rule_output_path = estimator.output_path + estimator.latest_training_job.job_name + "/rule-output"
print(f"You will find the profiler report in {rule_output_path}")

For more information about how to download and open the Debugger profiling report, see SageMaker Debugger Profiling Report in the SageMaker developer guide.

Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.

This us-east-1 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This us-east-2 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This us-west-1 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This ca-central-1 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This sa-east-1 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This eu-west-1 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This eu-west-2 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This eu-west-3 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This eu-central-1 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This eu-north-1 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This ap-southeast-1 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This ap-southeast-2 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This ap-northeast-1 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This ap-northeast-2 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This ap-south-1 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable