Managed Spot Training for XGBoost

This notebook shows usage of SageMaker Managed Spot infrastructure for XGBoost training. Below we show how Spot instances can be used for the ‘algorithm mode’ and ‘script mode’ training methods with the XGBoost container.

Managed Spot Training uses Amazon EC2 Spot instance to run training jobs instead of on-demand instances. You can specify which training jobs use spot instances and a stopping condition that specifies how long Amazon SageMaker waits for a job to run using Amazon EC2 Spot instances.

This notebook was tested in Amazon SageMaker Studio on a ml.t3.medium instance with Python 3 (Data Science) kernel.

In this notebook we will perform XGBoost training as described `here <>`__. See the original notebook for more details on the data.

Setup variables and define functions

[ ]:
!pip3 install -U sagemaker
[ ]:

import io
import os
import boto3
import sagemaker

role = sagemaker.get_execution_role()
region = boto3.Session().region_name

# S3 bucket for saving code and model artifacts.
# Feel free to specify a different bucket here if you wish.
bucket = sagemaker.Session().default_bucket()
prefix = "sagemaker/DEMO-xgboost-spot"
# customize to your bucket where you have would like to store the data

Fetching the dataset

[ ]:
s3 = boto3.client("s3")
# Load the dataset
FILE_DATA = "abalone"
    "sagemaker-sample-files", f"datasets/tabular/uci_abalone/abalone.libsvm", FILE_DATA
sagemaker.Session().upload_data(FILE_DATA, bucket=bucket, key_prefix=prefix + "/train")

Obtaining the latest XGBoost container

We obtain the new container by specifying the framework version (1.5-1). This version specifies the upstream XGBoost framework version (1.5) and an additional SageMaker version (1). If you have an existing XGBoost workflow based on the previous (1.0-1, 1.2-2 or 1.3-1) container, this would be the only change necessary to get the same workflow working with the new container.

[ ]:
container = sagemaker.image_uris.retrieve("xgboost", region, "1.5-1")

Training the XGBoost model

After setting training parameters, we kick off training, and poll for status until training is completed, which in this example, takes few minutes.

To run our training script on SageMaker, we construct a sagemaker.xgboost.estimator.XGBoost estimator, which accepts several constructor arguments:

  • entry_point: The path to the Python script SageMaker runs for training and prediction.

  • role: Role ARN

  • hyperparameters: A dictionary passed to the train function as hyperparameters.

  • train_instance_type (optional): The type of SageMaker instances for training. Note: This particular mode does not currently support training on GPU instance types.

  • sagemaker_session (optional): The session used to train on Sagemaker.

[ ]:
hyperparameters = {
    "max_depth": "5",
    "eta": "0.2",
    "gamma": "4",
    "min_child_weight": "6",
    "subsample": "0.7",
    "objective": "reg:squarederror",
    "num_round": "50",
    "verbosity": "2",

instance_type = "ml.m5.2xlarge"
output_path = "s3://{}/{}/{}/output".format(bucket, prefix, "abalone-xgb")
content_type = "libsvm"

If Spot instances are used, the training job can be interrupted, causing it to take longer to start or finish. If a training job is interrupted, a checkpointed snapshot can be used to resume from a previously saved point and can save training time (and cost).

To enable checkpointing for Managed Spot Training using SageMaker XGBoost we need to configure three things:

  1. Enable the train_use_spot_instances constructor arg - a simple self-explanatory boolean.

  2. Set the train_max_wait constructor arg - this is an int arg representing the amount of time you are willing to wait for Spot infrastructure to become available. Some instance types are harder to get at Spot prices and you may have to wait longer. You are not charged for time spent waiting for Spot infrastructure to become available, you’re only charged for actual compute time spent once Spot instances have been successfully procured.

  3. Setup a checkpoint_s3_uri constructor arg - this arg will tell SageMaker an S3 location where to save checkpoints. While not strictly necessary, checkpointing is highly recommended for Manage Spot Training jobs due to the fact that Spot instances can be interrupted with short notice and using checkpoints to resume from the last interruption ensures you don’t lose any progress made before the interruption.

Feel free to toggle the train_use_spot_instances variable to see the effect of running the same job using regular (a.k.a. “On Demand”) infrastructure.

Note that train_max_wait can be set if and only if train_use_spot_instances is enabled and must be greater than or equal to train_max_run.

[ ]:
import time
from sagemaker.inputs import TrainingInput

job_name = "DEMO-xgboost-spot-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
print("Training job", job_name)

use_spot_instances = True
max_run = 3600
max_wait = 7200 if use_spot_instances else None
checkpoint_s3_uri = (
    "s3://{}/{}/checkpoints/{}".format(bucket, prefix, job_name) if use_spot_instances else None
print("Checkpoint path:", checkpoint_s3_uri)

estimator = sagemaker.estimator.Estimator(
    volume_size=5,  # 5 GB
train_input = TrainingInput(
    s3_data="s3://{}/{}/{}".format(bucket, prefix, "train"), content_type="libsvm"
){"train": train_input}, job_name=job_name)


Towards the end of the job you should see two lines of output printed:

  • Training seconds: X : This is the actual compute-time your training job spent

  • Billable seconds: Y : This is the time you will be billed for after Spot discounting is applied.

If you enabled the train_use_spot_instances, then you should see a notable difference between X and Y signifying the cost savings you will get for having chosen Managed Spot Training. This should be reflected in an additional line: - Managed Spot Training savings: (1-Y/X)*100 %

Enabling checkpointing for script mode

An additional mode of operation is to run customizable scripts as part of the training and inference jobs. See this notebook for details on how to setup script mode.

Here we highlight the specific changes that would enable checkpointing and use Spot instances.

Checkpointing in the framework mode for SageMaker XGBoost can be performed using two convenient functions:

Both functions take the checkpoint directory as input, which in the below example is set to /opt/ml/checkpoints. The primary arguments that change for the xgb.train call are

  1. xgb_model: This refers to the previous checkpoint (saved from a previously run partial job) obtained by load_checkpoint. This would be None if no previous checkpoint is available.

  2. callbacks: This contains a function that performs the checkpointing

Updated script looks like the following.

CHECKPOINTS_DIR = '/opt/ml/checkpoints'   # default location for Checkpoints
callbacks = [save_checkpoint(CHECKPOINTS_DIR)]
prev_checkpoint, n_iterations_prev_run = load_checkpoint(CHECKPOINTS_DIR)
bst = xgb.train(
        num_boost_round=(args.num_round - n_iterations_prev_run),

Using the SageMaker XGBoost Estimator

The XGBoost estimator class in the SageMaker Python SDK allows us to run that script as a training job on the Amazon SageMaker managed training infrastructure. We’ll also pass the estimator our IAM role, the type of instance we want to use, and a dictionary of the hyperparameters that we want to pass to our script.

[ ]:
from sagemaker.xgboost.estimator import XGBoost

job_name = "DEMO-xgboost-regression-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
print("Training job", job_name)
checkpoint_s3_uri = (
    "s3://{}/{}/checkpoints/{}".format(bucket, prefix, job_name) if use_spot_instances else None
print("Checkpoint path:", checkpoint_s3_uri)

xgb_script_mode_estimator = XGBoost(
    output_path="s3://{}/{}/{}/output".format(bucket, prefix, "xgboost-script-mode"),

Training is as simple as calling fit on the Estimator. This will start a SageMaker Training job that will download the data, invoke the entry point code (in the provided script file), and save any model artifacts that the script creates. In this case, the script requires a train and a validation channel. Since we only created a train channel, we re-use it for validation.

[ ]:{"train": train_input, "validation": train_input}, job_name=job_name)