Profiling TensorFlow Multi GPU Multi Node Training Job with Amazon SageMaker Debugger (SageMaker API)
This notebook walks you through creating a TensorFlow training job with the SageMaker Debugger profiling feature enabled. It will create a multi GPU multi node training using Horovod.
(Optional) Install SageMaker and SMDebug
To use the new Debugger profiling features released in December 2020, ensure that you have the latest versions of Boto3, SageMaker, and SMDebug libraries installed. Use the following cell (switch install_needed
to True
) to update the libraries and restarts the Jupyter kernel to apply the updates.
[ ]:
import sys
import IPython
install_needed = False # should only be True once
if install_needed:
print("installing deps and restarting kernel")
!{sys.executable} -m pip install -U boto3 sagemaker smdebug
IPython.Application.instance().kernel.do_shutdown(True)
1. Create a Training Job with Debugger Enabled
You will learn how to use the Boto3 SageMaker client’s create_training_job()
function to start a training job.
Start a SageMaker session and retrieve the current region and the default Amazon S3 bucket URI
[ ]:
import sagemaker
session = sagemaker.Session()
region = session.boto_region_name
bucket = session.default_bucket()
print(region, bucket)
Upload a training script to the S3 bucket
[ ]:
import boto3, tarfile
source = "source.tar.gz"
project = "debugger-boto3-profiling-test"
tar = tarfile.open(source, "w:gz")
tar.add("entry_point/tf-hvd-train.py")
tar.close()
s3 = boto3.client("s3")
s3.upload_file(source, bucket, project + "/" + source)
[ ]:
upload_file_path = f"s3://{bucket}/{project}/{source}"
print(upload_file_path)
Create a Boto3 SageMaker client object
[ ]:
sm = boto3.Session(region_name=region).client("sagemaker")
Configure the request body of the create_training_job()
function
The following parameters are required to include to the request body for create_training_job()
function.
TrainingJobName
- Specify a prefix or a full name if you want to modifyHyperParameters
- Set up the following items:sagemaker_program
andsagemaker_submit_directory
- The S3 bucket URI of the training script. This enables SageMaker to read the training script from the URI and start a training job.sagemaker_mpi
options - Configure these key-value pairs to set up distributed training.You can also add other hyperparameters for your model.
AlgorithmSpecification
- SpecifyTrainingImage
. In this example, an official TensorFlow DLC image is used. You can also use your own training container images here.RoleArn
- To run the following cell, you must specify the right SageMaker execution role ARN that you want to use for training.The
DebugHookConfig
andDebugRuleConfigurations
are preconfigured for watching loss values and a loss not decreasing issue.The
ProfilerConfig
andProfilerRuleConfigurations
are preconfigured to collect system and framework metrics, initiate all profiling rules, and create a Debugger profiling report.
Important: For DebugRuleConfigurations
and ProfilerRuleConfigurations
, to run the following cell, you must specify the right Debugger rule image URI from `Amazon SageMaker Debugger Registry URLs for Built-in Rule Evaluators <https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-docker-images-rules.html>`__. The sagemaker.debugger.get_rule_container_image_uri(region)
function retrieves the Debugger rule image automatically. For example: - If you are in us-east-1
, the
right image URI is 503895931360.dkr.ecr.us-east-1.amazonaws.com/sagemaker-debugger-rules:latest. - If you are in us-west-2
, the right image URI is 895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rules:latest.
[ ]:
import datetime
training_job_name = "profiler-boto3-" + datetime.datetime.now().strftime("%Y-%m-%d-%H-%M-%S")
sm.create_training_job(
TrainingJobName=training_job_name,
HyperParameters={
"sagemaker_program": "entry_point/tf-hvd-train.py",
"sagemaker_submit_directory": "s3://" + bucket + "/" + project + "/" + source,
"sagemaker_mpi_custom_mpi_options": "-verbose -x HOROVOD_TIMELINE=./hvd_timeline.json -x NCCL_DEBUG=INFO -x OMPI_MCA_btl_vader_single_copy_mechanism=none",
"sagemaker_mpi_enabled": "true",
"sagemaker_mpi_num_of_processes_per_host": "4",
},
AlgorithmSpecification={
"TrainingImage": "763104351884.dkr.ecr."
+ region
+ ".amazonaws.com/tensorflow-training:2.4.1-gpu-py37-cu110-ubuntu18.04",
"TrainingInputMode": "File",
"EnableSageMakerMetricsTimeSeries": False,
},
# You must specify your SageMaker execution role ARN here
RoleArn=sagemaker.get_execution_role(),
OutputDataConfig={"S3OutputPath": "s3://" + bucket + "/" + project + "/output"},
ResourceConfig={"InstanceType": "ml.p3.8xlarge", "InstanceCount": 2, "VolumeSizeInGB": 30},
StoppingCondition={"MaxRuntimeInSeconds": 86400},
DebugHookConfig={
"S3OutputPath": "s3://" + bucket + "/" + project + "/debug-output",
"CollectionConfigurations": [
{"CollectionName": "losses", "CollectionParameters": {"train.save_interval": "50"}}
],
},
DebugRuleConfigurations=[
{
"RuleConfigurationName": "LossNotDecreasing",
# You must specify the correct image URI from https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-docker-images-rules.html
"RuleEvaluatorImage": sagemaker.debugger.get_rule_container_image_uri(region),
"RuleParameters": {"rule_to_invoke": "LossNotDecreasing"},
}
],
ProfilerConfig={
"S3OutputPath": "s3://" + bucket + "/" + project + "/profiler-output",
"ProfilingIntervalInMilliseconds": 500,
"ProfilingParameters": {
"DataloaderProfilingConfig": '{"StartStep": 5, "NumSteps": 3, "MetricsRegex": ".*", }',
"DetailedProfilingConfig": '{"StartStep": 5, "NumSteps": 3, }',
"PythonProfilingConfig": '{"StartStep": 5, "NumSteps": 3, "ProfilerName": "cprofile", "cProfileTimer": "total_time"}',
"LocalPath": "/opt/ml/output/profiler/", # Optional. Local path for profiling outputs
},
},
ProfilerRuleConfigurations=[
{
"RuleConfigurationName": "ProfilerReport",
# You must specify the correct image URI from https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-docker-images-rules.html
"RuleEvaluatorImage": sagemaker.debugger.get_rule_container_image_uri(region),
"RuleParameters": {"rule_to_invoke": "ProfilerReport"},
}
],
)
2. Analyze Profiling Data
Install the SMDebug client library to use Debugger analysis tools
[ ]:
import pip
def import_or_install(package):
try:
__import__(package)
except ImportError:
pip.main(["install", package])
import_or_install("smdebug")
Use SMDebug to retrieve saved output data and use analysis tools
While the training is still in progress you can visualize the performance data in SageMaker Studio or in the notebook. Debugger provides utilities to plot system metrics in form of timeline charts or heatmaps. Checkout out the notebook interactive_analysis_profiling_data.ipynb for more details. In the following code cell we
plot the total CPU and GPU utilization as timeseries charts. To visualize other metrics such as I/O, memory, network you simply need to extend the list passed to select_dimension
and select_events
.
[ ]:
from smdebug.profiler.analysis.notebook_utils.training_job import TrainingJob
tj = TrainingJob(training_job_name, region)
tj.wait_for_sys_profiling_data_to_be_available()
[ ]:
from smdebug.profiler.analysis.notebook_utils.timeline_charts import TimelineCharts
system_metrics_reader = tj.get_systems_metrics_reader()
system_metrics_reader.refresh_event_file_list()
view_timeline_charts = TimelineCharts(
system_metrics_reader,
framework_metrics_reader=None,
select_dimensions=["CPU", "GPU"],
select_events=["total"],
)
3. Download Debugger Profiling Report
The profiling report rule will create an html report profiler-report.html
with a summary of builtin rules and recommenades of next steps. You can find this report in your S3 bucket.
[ ]:
rule_output_path = (
"s3://"
+ bucket
+ "/"
+ project
+ "/output/"
+ training_job_name
+ "/rule-output/ProfilerReport/profiler-output/"
)
[ ]:
! aws s3 ls {rule_output_path} --recursive
[ ]:
! aws s3 cp {rule_output_path} . --recursive
[ ]:
from IPython.display import FileLink
display("Click link below to view the profiler report", FileLink("profiler-report.html"))
Note: If you are using JupyterLab, make sure you click ``Trust HTML`` at the top left corner after you open the report.
[ ]: