Training SageMaker Models using the Apache MXNet Module API on SageMaker Managed Spot Training
This notebook’s CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.
The example here is almost the same as Training and hosting SageMaker Models using the Apache MXNet Module API.
This notebook tackles the exact same problem with the same solution, but it has been modified to be able to run using SageMaker Managed Spot infrastructure. SageMaker Managed Spot uses EC2 Spot Instances to run Training at a lower cost.
Please read the original notebook and try it out to gain an understanding of the ML use-case and how it is being solved. We will not delve into that here in this notebook.
First setup variables and define functions
Again, we won’t go into detail explaining the code below, it has been lifted verbatim from Training and hosting SageMaker Models using the Apache MXNet Module API
[ ]:
from sagemaker import get_execution_role
from sagemaker.session import Session
# S3 bucket for saving code and model artifacts.
# Feel free to specify a different bucket here if you wish.
bucket = Session().default_bucket()
# Location to save your custom code in tar.gz format.
custom_prefix = "DEMO-customcode/mxnet"
custom_code_upload_location = "s3://{}/{}".format(bucket, custom_prefix)
# Location where results of model training are saved.
model_artifacts_location = "s3://{}/artifacts".format(bucket)
# IAM execution role that gives SageMaker access to resources in your AWS account.
# We can use the SageMaker Python SDK to get the role from our notebook environment.
role = get_execution_role()
import boto3
s3 = boto3.client("s3")
s3.download_file(
f"sagemaker-example-files-prod-{boto3.session.Session().region_name}",
"datasets/image/MNIST/train/train-images-idx3-ubyte.gz",
"train-images.gz",
)
s3.download_file(
f"sagemaker-example-files-prod-{boto3.session.Session().region_name}",
"datasets/image/MNIST/train/train-labels-idx1-ubyte.gz",
"train-labels.gz",
)
s3.download_file(
f"sagemaker-example-files-prod-{boto3.session.Session().region_name}",
"datasets/image/MNIST/test/t10k-images-idx3-ubyte.gz",
"test-images.gz",
)
s3.download_file(
f"sagemaker-example-files-prod-{boto3.session.Session().region_name}",
"datasets/image/MNIST/test/t10k-labels-idx1-ubyte.gz",
"test-labels.gz",
)
s3.upload_file("train-images.gz", bucket, custom_prefix + "/train/images.gz")
s3.upload_file("train-labels.gz", bucket, custom_prefix + "/train/labels.gz")
s3.upload_file("test-images.gz", bucket, custom_prefix + "/test/images.gz")
s3.upload_file("test-labels.gz", bucket, custom_prefix + "/test/labels.gz")
train_data_location = "s3://{}/{}/test".format(bucket, custom_prefix)
test_data_location = "s3://{}/{}/test".format(bucket, custom_prefix)
Managed Spot Training with MXNet
For Managed Spot Training using MXNet we need to configure three things: 1. Enable the train_use_spot_instances
constructor arg - a simple self-explanatory boolean. 2. Set the train_max_wait
constructor arg - this is an int arg representing the amount of time you are willing to wait for Spot infrastructure to become available. Some instance types are harder to get at Spot prices and you may have to wait longer. You are not charged for time spent waiting for Spot infrastructure to become
available, you’re only charged for actual compute time spent once Spot instances have been successfully procured. 3. Setup a checkpoint_s3_uri
constructor arg. This arg will tell SageMaker an S3 location where to save checkpoints (assuming your algorithm has been modified to save checkpoints periodically). While not strictly necessary checkpointing is highly recommended for Manage Spot Training jobs due to the fact that Spot instances can be interrupted with short notice and using
checkpoints to resume from the last interruption ensures you don’t lose any progress made before the interruption.
Feel free to toggle the train_use_spot_instances
variable to see the effect of running the same job using regular (a.k.a. “On Demand”) infrastructure.
Note that train_max_wait
can be set if and only if train_use_spot_instances
is enabled and must be greater than or equal to train_max_run
.
[ ]:
train_use_spot_instances = True
train_max_run = 3600
train_max_wait = 7200 if train_use_spot_instances else None
import uuid
checkpoint_suffix = str(uuid.uuid4())[:8]
checkpoint_s3_uri = (
"s3://{}/artifacts/mxnet-checkpoint-{}/".format(bucket, checkpoint_suffix)
if train_use_spot_instances
else None
)
[ ]:
from sagemaker.mxnet import MXNet
mnist_estimator = MXNet(
entry_point="mnist.py",
role=role,
output_path=model_artifacts_location,
code_location=custom_code_upload_location,
instance_count=1,
instance_type="ml.m4.xlarge",
framework_version="1.6.0",
py_version="py3",
distribution={"parameter_server": {"enabled": True}},
hyperparameters={"learning-rate": 0.1},
use_spot_instances=train_use_spot_instances,
max_run=train_max_run,
max_wait=train_max_wait,
checkpoint_s3_uri=checkpoint_s3_uri,
)
mnist_estimator.fit({"train": train_data_location, "test": test_data_location})
Savings
Towards the end of the job you should see two lines of output printed:
Training seconds: X
: This is the actual compute-time your training job spentBillable seconds: Y
: This is the time you will be billed for after Spot discounting is applied.
If you enabled the train_use_spot_instances
var then you should see a notable difference between X
and Y
signifying the cost savings you will get for having chosen Managed Spot Training. This should be reflected in an additional line: - Managed Spot Training savings: (1-Y/X)*100 %
Cleanup
[ ]:
mnist_estimator.delete_endpoint()
Notebook CI Test Results
This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.