Managed Spot Training for XGBoost
This notebook’s CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.
This notebook shows usage of SageMaker Managed Spot infrastructure for XGBoost training. Below we show how Spot instances can be used for the ‘algorithm mode’ and ‘script mode’ training methods with the XGBoost container.
Managed Spot Training uses Amazon EC2 Spot instance to run training jobs instead of on-demand instances. You can specify which training jobs use spot instances and a stopping condition that specifies how long Amazon SageMaker waits for a job to run using Amazon EC2 Spot instances.
This notebook was tested in Amazon SageMaker Studio on a ml.t3.medium instance with Python 3 (Data Science) kernel.
In this notebook we will perform XGBoost training as described `here <>`__. See the original notebook for more details on the data.
Setup variables and define functions
[ ]:
!pip install --upgrade sagemaker
[ ]:
%%time
import io
import os
import boto3
import sagemaker
role = sagemaker.get_execution_role()
region = boto3.Session().region_name
# S3 bucket for saving code and model artifacts.
# Feel free to specify a different bucket here if you wish.
bucket = sagemaker.Session().default_bucket()
prefix = "sagemaker/DEMO-xgboost-spot"
# customize to your bucket where you have would like to store the data
Fetching the dataset
[ ]:
%%time
s3 = boto3.client("s3")
# Load the dataset
FILE_DATA = "abalone"
s3.download_file(
f"sagemaker-example-files-prod-{region}",
f"datasets/tabular/uci_abalone/abalone.libsvm",
FILE_DATA,
)
sagemaker.Session().upload_data(FILE_DATA, bucket=bucket, key_prefix=prefix + "/train")
sagemaker.Session().upload_data(FILE_DATA, bucket=bucket, key_prefix=prefix + "/validation")
Obtaining the latest XGBoost container
We obtain the new container by specifying the framework version (1.7-1). This version specifies the upstream XGBoost framework version (1.7) and an additional SageMaker version (1). If you have an existing XGBoost workflow based on the previous (1.0-1, 1.2-2, 1.3-1 or 1.5-1) container, this would be the only change necessary to get the same workflow working with the new container.
[ ]:
container = sagemaker.image_uris.retrieve("xgboost", region, "1.7-1")
Training the XGBoost model
After setting training parameters, we kick off training, and poll for status until training is completed, which in this example, takes few minutes.
To run our training script on SageMaker, we construct a sagemaker.xgboost.estimator.XGBoost estimator, which accepts several constructor arguments:
entry_point: The path to the Python script SageMaker runs for training and prediction.
role: Role ARN
hyperparameters: A dictionary passed to the train function as hyperparameters.
train_instance_type (optional): The type of SageMaker instances for training. Note: This particular mode does not currently support training on GPU instance types.
sagemaker_session (optional): The session used to train on Sagemaker.
[ ]:
hyperparameters = {
"max_depth": "5",
"eta": "0.2",
"gamma": "4",
"min_child_weight": "6",
"subsample": "0.7",
"objective": "reg:squarederror",
"num_round": "50",
"verbosity": "2",
}
instance_type = "ml.m5.4xlarge"
output_path = "s3://{}/{}/{}/output".format(bucket, prefix, "abalone-xgb")
content_type = "libsvm"
If Spot instances are used, the training job can be interrupted, causing it to take longer to start or finish. If a training job is interrupted, a checkpointed snapshot can be used to resume from a previously saved point and can save training time (and cost).
To enable checkpointing for Managed Spot Training using SageMaker XGBoost we need to configure three things:
Enable the
train_use_spot_instances
constructor arg - a simple self-explanatory boolean.Set the
train_max_wait constructor
arg - this is an int arg representing the amount of time you are willing to wait for Spot infrastructure to become available. Some instance types are harder to get at Spot prices and you may have to wait longer. You are not charged for time spent waiting for Spot infrastructure to become available, you’re only charged for actual compute time spent once Spot instances have been successfully procured.Setup a
checkpoint_s3_uri
constructor arg - this arg will tell SageMaker an S3 location where to save checkpoints. While not strictly necessary, checkpointing is highly recommended for Manage Spot Training jobs due to the fact that Spot instances can be interrupted with short notice and using checkpoints to resume from the last interruption ensures you don’t lose any progress made before the interruption.
Feel free to toggle the train_use_spot_instances
variable to see the effect of running the same job using regular (a.k.a. “On Demand”) infrastructure.
Note that train_max_wait
can be set if and only if train_use_spot_instances
is enabled and must be greater than or equal to train_max_run
.
[ ]:
import time
from sagemaker.inputs import TrainingInput
job_name = "DEMO-xgboost-spot-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
print("Training job", job_name)
use_spot_instances = True
max_run = 3600
max_wait = 7200 if use_spot_instances else None
checkpoint_s3_uri = (
"s3://{}/{}/checkpoints/{}".format(bucket, prefix, job_name) if use_spot_instances else None
)
print("Checkpoint path:", checkpoint_s3_uri)
estimator = sagemaker.estimator.Estimator(
container,
role,
hyperparameters=hyperparameters,
instance_count=1,
instance_type=instance_type,
volume_size=5, # 5 GB
output_path=output_path,
sagemaker_session=sagemaker.Session(),
use_spot_instances=use_spot_instances,
max_run=max_run,
max_wait=max_wait,
checkpoint_s3_uri=checkpoint_s3_uri,
)
train_input = TrainingInput(
s3_data="s3://{}/{}/{}".format(bucket, prefix, "train"), content_type="libsvm"
)
estimator.fit({"train": train_input}, job_name=job_name)
Savings
Towards the end of the job you should see two lines of output printed:
Training seconds: X
: This is the actual compute-time your training job spentBillable seconds: Y
: This is the time you will be billed for after Spot discounting is applied.
If you enabled the train_use_spot_instances
, then you should see a notable difference between X
and Y
signifying the cost savings you will get for having chosen Managed Spot Training. This should be reflected in an additional line: - Managed Spot Training savings: (1-Y/X)*100 %
Train with Automatic Model Tuning (HPO) and Spot Training enabled
You could also train with Amazon SageMaker Automatic Model Tuning. AMT, also known as hyperparameter tuning, finds the best version of a model by running many training jobs on your dataset using the algorithm and ranges of hyperparameters that you specify. It then chooses the hyperparameter values that result in a model that performs the best, as measured by a metric that you choose. We will use a HyperparameterTuner object to interact with Amazon SageMaker hyperparameter tuning APIs.
The code sample below shows you how to use the HyperParameterTuner and Spot Training together. ***
[ ]:
from sagemaker.tuner import ContinuousParameter, IntegerParameter
from sagemaker.utils import name_from_base
from sagemaker.tuner import HyperparameterTuner
# You can select from the hyperparameters supported by the model, and configure ranges of values to be searched for training the optimal model.(https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-define-ranges.html)
hyperparameter_ranges = {
"max_depth": IntegerParameter(0, 10, scaling_type="Auto"),
"num_round": IntegerParameter(1, 4000, scaling_type="Auto"),
"alpha": ContinuousParameter(0, 2, scaling_type="Auto"),
"subsample": ContinuousParameter(0.5, 1, scaling_type="Auto"),
"min_child_weight": ContinuousParameter(0, 120, scaling_type="Auto"),
"gamma": ContinuousParameter(0, 5, scaling_type="Auto"),
"eta": ContinuousParameter(0.1, 0.5, scaling_type="Auto"),
}
# Increase the total number of training jobs run by AMT, for increased accuracy (and training time).
max_jobs = 6
# Change parallel training jobs run by AMT to reduce total training time, constrained by your account limits.
# if max_jobs=max_parallel_jobs then Bayesian search turns to Random.
max_parallel_jobs = 2
hp_tuner = HyperparameterTuner(
estimator,
"validation:rmse",
hyperparameter_ranges,
max_jobs=max_jobs,
max_parallel_jobs=max_parallel_jobs,
objective_type="Minimize",
base_tuning_job_name=job_name,
)
# Launch a SageMaker Tuning job to search for the best hyperparameters
# In this case, the tuner requires a `validation` channel to emit the validation:rmse metric.
# Since we only created a `train` channel, we re-use it for validation.
hp_tuner.fit({"train": train_input, "validation": train_input})
Enabling checkpointing for script mode
An additional mode of operation is to run customizable scripts as part of the training and inference jobs. See this notebook for details on how to setup script mode.
Here we highlight the specific changes that would enable checkpointing and use Spot instances.
Checkpointing in the framework mode for SageMaker XGBoost can be performed using two convenient functions:
save_checkpoint
: this returns a callback function that performs checkpointing of the model for each round. This is passed to XGBoost as part of the`callbacks
<https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.train>`__ argument.load_checkpoint
: This is used to load existing checkpoints to ensure training resumes from where it previously stopped.
Both functions take the checkpoint directory as input, which in the below example is set to /opt/ml/checkpoints
. The primary arguments that change for the xgb.train
call are
xgb_model
: This refers to the previous checkpoint (saved from a previously run partial job) obtained byload_checkpoint
. This would beNone
if no previous checkpoint is available.callbacks
: This contains a function that performs the checkpointing
Updated script looks like the following.
CHECKPOINTS_DIR = '/opt/ml/checkpoints' # default location for Checkpoints
callbacks = [save_checkpoint(CHECKPOINTS_DIR)]
prev_checkpoint, n_iterations_prev_run = load_checkpoint(CHECKPOINTS_DIR)
bst = xgb.train(
params=train_hp,
dtrain=dtrain,
evals=watchlist,
num_boost_round=(args.num_round - n_iterations_prev_run),
xgb_model=prev_checkpoint,
callbacks=callbacks
)
Using the SageMaker XGBoost Estimator
The XGBoost estimator class in the SageMaker Python SDK allows us to run that script as a training job on the Amazon SageMaker managed training infrastructure. We’ll also pass the estimator our IAM role, the type of instance we want to use, and a dictionary of the hyperparameters that we want to pass to our script.
[ ]:
from sagemaker.xgboost.estimator import XGBoost
job_name = "DEMO-xgboost-regression-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
print("Training job", job_name)
checkpoint_s3_uri = (
"s3://{}/{}/checkpoints/{}".format(bucket, prefix, job_name) if use_spot_instances else None
)
print("Checkpoint path:", checkpoint_s3_uri)
xgb_script_mode_estimator = XGBoost(
entry_point="abalone.py",
hyperparameters=hyperparameters,
role=role,
instance_count=1,
instance_type=instance_type,
framework_version="1.7-1",
output_path="s3://{}/{}/{}/output".format(bucket, prefix, "xgboost-script-mode"),
use_spot_instances=use_spot_instances,
max_run=max_run,
max_wait=max_wait,
checkpoint_s3_uri=checkpoint_s3_uri,
)
Training is as simple as calling fit
on the Estimator. This will start a SageMaker Training job that will download the data, invoke the entry point code (in the provided script file), and save any model artifacts that the script creates. In this case, the script requires a train
and a validation
channel. Since we only created a train
channel, we re-use it for validation.
[ ]:
xgb_script_mode_estimator.fit({"train": train_input, "validation": train_input}, job_name=job_name)
As previously stated, the estimator can also be passed to the HyperparameterTuner object to interact with the Amazon SageMaker hyperparameter tuning APIs and create a HyperParameter Tuning Job. Hyper Parameters are automatically tuned which in most cases results in a more accurate model.
[ ]:
hp_tuner = HyperparameterTuner(
xgb_script_mode_estimator,
"validation:rmse",
hyperparameter_ranges,
max_jobs=max_jobs,
max_parallel_jobs=max_parallel_jobs,
objective_type="Minimize",
base_tuning_job_name=job_name,
)
# Launch a SageMaker Tuning job to search for the best hyperparameters
# In this case, the tuner requires a `validation` channel to emit the validation:rmse metric.
# Since we only created a `train` channel, we re-use it for validation.
hp_tuner.fit({"train": train_input, "validation": train_input})
Notebook CI Test Results
This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.