Ensemble Predictions From Multiple Models

Combining a Linear-Learner with XGBoost for superior predictive performance


  1. Background

  2. Preparation

  3. Data

    1. Exploration and Transformation

  4. Training XGBoost model using SageMaker

  5. Hosting the model

  6. Evaluating the model on test samples

  7. Training a second Logistic Regression model using SageMaker

  8. Hosting the Second model

  9. Evaluating the model on test samples

  10. Combining the model results

  11. Evaluating the combined model on test samples

  12. Extensions


Quite often, in practical applications of Machine-Learning on predictive tasks, one model doesn’t suffice. Most of the prediction competitions typically require combining forecasts from multiple sources to get an improved forecast. By combining or averaging predictions from multiple sources/models we typically get an improved forecast. This happens as there is considerable uncertainty in the choice of the model and there is no one true model in many practical applications. It is therefore beneficial to combine predictions from different models. In the Bayesian literature, this idea is referred as Bayesian Model Averaging http://www.stat.colostate.edu/~jah/papers/statsci.pdf and has been shown to work much better than just picking one model.

This notebook presents an illustrative example to predict if a person makes over 50K a year based on information about their education, work-experience, gender etc.

  • Preparing your SageMaker notebook

  • Loading a dataset from S3 using SageMaker

  • Investigating and transforming the data so that it can be fed to SageMaker algorithms

  • Estimating a model using SageMaker’s XGBoost (Extreme Gradient Boosting) algorithm

  • Hosting the model on SageMaker to make on-going predictions

  • Estimating a second model using SageMaker’s Linear Learner method

  • Combining the predictions from both the models and evaluating the combined prediction

  • Generating final predictions on the test data set


Let’s start by specifying:

  • The SageMaker role arn used to give learning and hosting access to your data. See the documentation for how to create these. Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto call with the appropriate full SageMaker role arn string.

  • The S3 bucket that you want to use for training and storing model objects.

[ ]:
import os
import boto3
import time
import re
import sagemaker

role = sagemaker.get_execution_role()

# Now let's define the S3 bucket we'll used for the remainder of this example.

sess = sagemaker.Session()
region = sess.boto_region_name
bucket = (
)  #  enter your s3 bucket where you will copy data and model artificats
prefix = "sagemaker/DEMO-xgboost"  # place to upload training files within the bucket
print(f"output data will be stored in: {bucket}")

Now let’s bring in the Python libraries that we’ll use throughout the analysis

[ ]:
import numpy as np  # For matrix operations and numerical processing
import pandas as pd  # For munging tabular data
import sklearn as sk  # For access to a variety of machine learning models
import matplotlib.pyplot as plt  # For charts and visualizations
from IPython.display import Image  # For displaying images in the notebook
from IPython.display import display  # For displaying outputs in the notebook
from sklearn.datasets import dump_svmlight_file  # For outputting data to libsvm format for xgboost
from time import gmtime, strftime  # For labeling SageMaker models, endpoints, etc.
import sys  # For writing outputs to notebook
import math  # For ceiling function
import json  # For parsing hosting output
import io  # For working with stream data
import sagemaker.amazon.common as smac  # For protobuf data format


Let’s start by downloading publicly available Census Income dataset available at https://archive.ics.uci.edu/ml/datasets/Adult. In this dataset we have different attributes such as age, work class, education, country, race etc for each person. We also have an indicator of person’s income being more than $50K a year. The prediction task is to determine whether a person makes over 50K a year.

  • Data comes in two separate files: adult.data and adult.test

  • The field names as well as additional information is available in the file adult.names

Now lets read this into a Pandas data frame and take a look.

[ ]:
s3 = boto3.client("s3")

## read the data
    f"sagemaker-example-files-prod-{region}", "datasets/tabular/uci_adult/adult.data", "adult.data"
data = pd.read_csv("adult.data", header=None)

## read test data
    f"sagemaker-example-files-prod-{region}", "datasets/tabular/uci_adult/adult.test", "adult.test"
data_test = pd.read_csv(

## set column names
data.columns = [

data_test.columns = [


Data exploration and transformations

In what follows we will do a basic exploration of the dataset to understand the size of data, various fields it has, the values different features take, distribution of target values etc.

[ ]:
# set display options
pd.set_option("display.max_columns", 100)  # Make sure we can see all of the columns
pd.set_option("display.max_rows", 6)  # Keep the output on one page

# disply data

# display positive and negative counts
display(data.iloc[:, 14].value_counts())
display(data_test.iloc[:, 14].value_counts())
[ ]:
## Combine the two datasets to convert the categorical values to binary indicators
data_combined = pd.concat([data, data_test])

## convert the categorical variables to binary indicators
data_combined_bin = pd.get_dummies(

# combine the income >50k indicators
Income_50k = ((data_combined_bin.iloc[:, 101] == 1) | (data_combined_bin.iloc[:, 102] == 1)) + 0

# make the income indicator as first column
data_combined_bin = pd.concat([Income_50k, data_combined_bin.iloc[:, 0:100]], axis=1)

# Post conversion to binary split the data sets separately
data_bin = data_combined_bin.iloc[0 : data.shape[0], :]
data_test_bin = data_combined_bin.iloc[data.shape[0] :, :]

# display the data sets post conversion to binary indicators

# count number of positives and negatives
display(data_bin.iloc[:, 0].value_counts())
display(data_test_bin.iloc[:, 0].value_counts())

Data Description

Let’s talk about the data. At a high level, we can see:

  • There are 15 columns and around 32K rows in the training data

  • There are 15 columns and around 16 K rows in the test data

  • IncomeGroup is the target field

Specifics on the features: * 9 of the 14 features are categorical and remaining 5 are numeric * When we convert the catgorical features to binary we find there are altogether 103-1 =102 features

Target variable: * IncomeGroup_>50K: Whether or not annual income was more than 50K

Xgboost model

Train a model first using xgboost

As our first training algorithm we pick xgboost algorithm. xgboost is an extremely popular, open-source package for gradient boosted trees. It is computationally powerful, fully featured, and has been successfully used in many machine learning competitions. Let’s start with a simple xgboost model, trained using SageMaker's managed, distributed training framework.

First we’ll need to specify training parameters. This includes: 1. The role to use 1. Our training job name 1. The xgboost algorithm container 1. Training instance type and count 1. S3 location for training data 1. S3 location for output data 1. Algorithm hyperparameters 1. Stopping conditions

Supported Training Input Format: csv, libsvm. For csv input, right now we assume the input is separated by delimiter(automatically detect the separator by Python’s builtin sniffer tool), without a header line and also label is in the first column. Scoring Output Format: csv.

  • Since our data is in CSV format, we will convert our dataset to the way SageMaker’s XGboost supports.

  • We will keep the target field in first column and remaining features in the next few columns

  • We will remove the header line

  • We will also split the data into a separate training and validation sets

  • Store the data into our s3 bucket

Split the data into 80% training and 20% validation and save it before calling XGboost

[ ]:
# Split the data randomly as 80% for training and remaining 20% and save them locally
train_list = np.random.rand(len(data_bin)) < 0.8
data_train = data_bin[train_list]
data_val = data_bin[~train_list]
data_train.to_csv("formatted_train.csv", sep=",", header=False, index=False)  # save training data
data_val.to_csv("formatted_val.csv", sep=",", header=False, index=False)  # save validation data
data_test_bin.to_csv("formatted_test.csv", sep=",", header=False, index=False)  # save test data

Upload training and validation data sets in the s3 bucket and prefix provided

[ ]:
train_file = "formatted_train.csv"
val_file = "formatted_val.csv"

    os.path.join(prefix, "train/", train_file)
    os.path.join(prefix, "val/", val_file)

Specify images used for training and hosting SageMaker’s Xgboost algorithm

[ ]:
from sagemaker.amazon.amazon_estimator import image_uris

xgboost_container = image_uris.retrieve(
    region=boto3.Session().region_name, framework="xgboost", version="1"
[ ]:
import boto3
from time import gmtime, strftime

xgboost_job_name = "DEMO-xgboost-single-censusincome-" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print("Training job", xgboost_job_name)

create_training_params = {
    "AlgorithmSpecification": {"TrainingImage": xgboost_container, "TrainingInputMode": "File"},
    "RoleArn": role,
    "OutputDataConfig": {
        "S3OutputPath": "s3://{}/{}/single-xgboost/".format(bucket, prefix),
    "ResourceConfig": {"InstanceCount": 1, "InstanceType": "ml.m4.4xlarge", "VolumeSizeInGB": 20},
    "TrainingJobName": xgboost_job_name,
    "HyperParameters": {
        "max_depth": "5",
        "eta": "0.1",
        "gamma": "1",
        "min_child_weight": "1",
        "silent": "0",
        "objective": "binary:logistic",
        "eval_metric": "auc",
        "num_round": "20",
    "StoppingCondition": {"MaxRuntimeInSeconds": 60 * 60},
    "InputDataConfig": [
            "ChannelName": "train",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": "s3://{}/{}/train/".format(bucket, prefix),
                    "S3DataDistributionType": "FullyReplicated",
            "ContentType": "csv",
            "CompressionType": "None",
            "ChannelName": "validation",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": "s3://{}/{}/val/".format(bucket, prefix),
                    "S3DataDistributionType": "FullyReplicated",
            "ContentType": "csv",
            "CompressionType": "None",

Now let’s kick off our training job in SageMaker’s distributed, managed training, using the parameters we just created. Because training is managed, we don’t have to wait for our job to finish to continue, but for this case, let’s setup a while loop so we can monitor the status of our training.

[ ]:

region = boto3.Session().region_name
sm = boto3.client("sagemaker")


status = sm.describe_training_job(TrainingJobName=xgboost_job_name)["TrainingJobStatus"]
if status == "Failed":
    message = sm.describe_training_job(TrainingJobName=xgboost_job_name)["FailureReason"]
    print("Training failed with the following error: {}".format(message))
    raise Exception("Training job failed")

We can read the training and evluation metrics from AWS cloudwatch. train-auc: 0.916177 and validation-auc:0.906567.


Train a second model using SageMaker’s Linear Learner

[ ]:
prefix = "sagemaker/DEMO-linear"  ##subfolder inside the data bucket to be used for Linear Learner

data_train = pd.read_csv("formatted_train.csv", sep=",", header=None)
data_test = pd.read_csv("formatted_test.csv", sep=",", header=None)
data_val = pd.read_csv("formatted_val.csv", sep=",", header=None)

train_y = data_train.iloc[:, 0].values
train_X = data_train.iloc[:, 1:].values

val_y = data_val.iloc[:, 0].values
val_X = data_val.iloc[:, 1:].values

test_y = data_test.iloc[:, 0].values
test_X = data_test.iloc[:, 1:].values;

Now, we’ll convert the datasets to the recordIO wrapped protobuf format used by the Amazon SageMaker algorithms and upload this data to S3. We’ll start with training data.

Convert to protobuf format and upload the training and validation data to s3

[ ]:
train_file = "linear_train.data"

f = io.BytesIO()
smac.write_numpy_to_dense_tensor(f, train_X.astype("float32"), train_y.astype("float32"))

    os.path.join(prefix, "train", train_file)
[ ]:
validation_file = "linear_validation.data"

f = io.BytesIO()
smac.write_numpy_to_dense_tensor(f, val_X.astype("float32"), val_y.astype("float32"))

    os.path.join(prefix, "validation", train_file)

Training Algorithm Specifications

Now we can begin to specify our linear model. Amazon SageMaker’s Linear Learner actually fits many models in parallel, each with slightly different hyperparameters, and then returns the one with the best fit. This functionality is automatically enabled. We can influence this using parameters like:

  • num_models to increase to total number of models run. The specified parameters will always be one of those models, but the algorithm also chooses models with nearby parameter values in order to find a solution nearby that may be more optimal. In this case, we’re going to use the max of 32.

  • loss which controls how we penalize mistakes in our model estimates. For this case, let’s use logistic loss as we are interested in estimating probabilities.

  • wd or l1 which control regularization. Regularization can prevent model overfitting by preventing our estimates from becoming too finely tuned to the training data, which can actually hurt generalizability. In this case, we’ll leave these parameters as their default “auto” though.

Specify images used for training and hosting SageMaker’s linear-learner

[ ]:
from sagemaker.amazon.amazon_estimator import image_uris

linear_container = image_uris.retrieve(
    region=boto3.Session().region_name, framework="linear-learner", version="1"
[ ]:
linear_job = "DEMO-linear-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

print("Job name is:", linear_job)

linear_training_params = {
    "RoleArn": role,
    "TrainingJobName": linear_job,
    "AlgorithmSpecification": {"TrainingImage": linear_container, "TrainingInputMode": "File"},
    "ResourceConfig": {"InstanceCount": 1, "InstanceType": "ml.c4.2xlarge", "VolumeSizeInGB": 10},
    "InputDataConfig": [
            "ChannelName": "train",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": "s3://{}/{}/train/".format(bucket, prefix),
                    "S3DataDistributionType": "ShardedByS3Key",
            "CompressionType": "None",
            "RecordWrapperType": "None",
            "ChannelName": "validation",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": "s3://{}/{}/validation/".format(bucket, prefix),
                    "S3DataDistributionType": "FullyReplicated",
            "CompressionType": "None",
            "RecordWrapperType": "None",
    "OutputDataConfig": {"S3OutputPath": "s3://{}/{}/".format(bucket, prefix)},
    "HyperParameters": {
        "feature_dim": "100",
        "mini_batch_size": "100",
        "predictor_type": "binary_classifier",
        "epochs": "10",
        "num_models": "32",
        "loss": "logistic",
    "StoppingCondition": {"MaxRuntimeInSeconds": 60 * 60},

Now let’s kick off our training job in SageMaker’s distributed, managed training, using the parameters we just created. Because training is managed, we don’t have to wait for our job to finish to continue, but for this case, let’s setup a while loop so we can monitor the status of our training.

[ ]:

region = boto3.Session().region_name
sm = boto3.client("sagemaker")

status = sm.describe_training_job(TrainingJobName=linear_job)["TrainingJobStatus"]
if status == "Failed":
    message = sm.describe_training_job(TrainingJobName=linear_job)["FailureReason"]
    print("Training failed with the following error: {}".format(message))
    raise Exception("Training job failed")


Now that we’ve trained both the models on our data, let’s get them hosted. We will: 1. Point to the scoring containers 1. Point to the model.tar.gz that came from training 1. Create the hosting model with both containers using SageMaker multi-container endpoints

[ ]:
model_name = "DEMO-MODEL-for-ensemble-modelling-" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
xgboost_hosting_container = {
    "Image": xgboost_container,
    "ContainerHostname": "xgboost",
    "ModelDataUrl": sm.describe_training_job(TrainingJobName=xgboost_job_name)["ModelArtifacts"][

linear_hosting_container = {
    "Image": linear_container,
    "ContainerHostname": "linear",
    "ModelDataUrl": sm.describe_training_job(TrainingJobName=linear_job)["ModelArtifacts"][

inferenceExecutionConfig = {"Mode": "Direct"}

create_model_response = sm.create_model(
    Containers=[xgboost_hosting_container, linear_hosting_container],
[ ]:

Once we’ve setup a model, we can configure what our hosting endpoints should be. Here we specify: 1. EC2 instance type to use for hosting 1. Initial number of instances 1. Our hosting model name

[ ]:
from time import gmtime, strftime

endpoint_config_name = "DEMO-ENDPOINT-CONFIG-for-ensemble-modelling-" + strftime(
    "%Y-%m-%d-%H-%M-%S", gmtime()
create_endpoint_config_response = sm.create_endpoint_config(
            "InstanceType": "ml.m4.xlarge",
            "InitialInstanceCount": 1,
            "InitialVariantWeight": 1,
            "ModelName": model_name,
            "VariantName": "AllTraffic",

print("Endpoint Config Arn: " + create_endpoint_config_response["EndpointConfigArn"])

Create endpoint

Lastly, the customer creates the endpoint that serves up the model, through specifying the name and configuration defined above. The end result is an endpoint that can be validated and incorporated into production applications. This takes 9-11 minutes to complete.

[ ]:
import time

endpoint_name = "DEMO-ENDPOINT-for-ensemble-modelling-" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
create_endpoint_response = sm.create_endpoint(
    EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name

resp = sm.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    resp = sm.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

Evaluation - XGBoost

There are many ways to compare the performance of a machine learning model. In this example, we will generate predictions and compare the ranking metric AUC (Area Under the ROC Curve).

[ ]:
runtime = boto3.client("runtime.sagemaker")
[ ]:
# Simple function to create a csv from our numpy array

def np2csv(arr):
    csv = io.BytesIO()
    np.savetxt(csv, arr, delimiter=",", fmt="%g")
    return csv.getvalue().decode().rstrip()
[ ]:
# Function to generate prediction through sample data
def do_predict(data, endpoint_name, content_type):
    payload = np2csv(data)
    response = runtime.invoke_endpoint(
    result = response["Body"].read()
    result = result.decode("utf-8")
    result = result.split(",")
    preds = [float((num)) for num in result]
    return preds

# Function to iterate through a larger data set and generate batch predictions
def batch_predict(data, batch_size, endpoint_name, content_type):
    items = len(data)
    arrs = []

    for offset in range(0, items, batch_size):
        if offset + batch_size < items:
            datav = data.iloc[offset : (offset + batch_size), :].values
            results = do_predict(datav, endpoint_name, content_type)
            datav = data.iloc[offset:items, :].values
            arrs.extend(do_predict(datav, endpoint_name, content_type))
    return arrs
[ ]:
### read the saved data for scoring
data_train = pd.read_csv("formatted_train.csv", sep=",", header=None)
data_test = pd.read_csv("formatted_test.csv", sep=",", header=None)
data_val = pd.read_csv("formatted_val.csv", sep=",", header=None)

Generate predictions on train, validation and test sets

[ ]:
preds_train_xgb = batch_predict(data_train.iloc[:, 1:], 1000, endpoint_name, "text/csv")
preds_val_xgb = batch_predict(data_val.iloc[:, 1:], 1000, endpoint_name, "text/csv")
preds_test_xgb = batch_predict(data_test.iloc[:, 1:], 1000, endpoint_name, "text/csv")

Compute performance metrics on the training,validation, test data sets

compute auc/ginni

[ ]:
from sklearn.metrics import roc_auc_score

train_labels = data_train.iloc[:, 0]
val_labels = data_val.iloc[:, 0]
test_labels = data_test.iloc[:, 0]

print("Training AUC", roc_auc_score(train_labels, preds_train_xgb))  ##0.9161
print("Validation AUC", roc_auc_score(val_labels, preds_val_xgb))  ###0.9065
print("Test AUC", roc_auc_score(test_labels, preds_test_xgb))  ###0.9112

Evaluation - Linear-Learner

Predict using SageMaker’s Linear Learner and evaluate the performance

Now that we have our hosted endpoint, we can generate statistical predictions from it. Let’s predict on our test dataset to understand how accurate our model is on unseen samples using AUC metric.

[ ]:
def np2csv(arr):
    csv = io.BytesIO()
    np.savetxt(csv, arr, delimiter=",", fmt="%g")
    return csv.getvalue().decode().rstrip()
[ ]:
# Function to generate prediction through sample data
def do_predict_linear(data, endpoint_name, content_type):
    payload = np2csv(data)
    response = runtime.invoke_endpoint(
    result = json.loads(response["Body"].read().decode())
    preds = [r["score"] for r in result["predictions"]]

    return preds

# Function to iterate through a larger data set and generate batch predictions
def batch_predict_linear(data, batch_size, endpoint_name, content_type):
    items = len(data)
    arrs = []

    for offset in range(0, items, batch_size):
        if offset + batch_size < items:
            datav = data.iloc[offset : (offset + batch_size), :].values
            results = do_predict_linear(datav, endpoint_name, content_type)
            datav = data.iloc[offset:items, :].values
            arrs.extend(do_predict_linear(datav, endpoint_name, content_type))
    return arrs
[ ]:
### Predict on Training Data
preds_train_lin = batch_predict_linear(data_train.iloc[:, 1:], 100, endpoint_name, "text/csv")
[ ]:
### Predict on Validation Data
preds_val_lin = batch_predict_linear(data_val.iloc[:, 1:], 100, endpoint_name, "text/csv")
[ ]:
### Predict on Test Data
preds_test_lin = batch_predict_linear(data_test.iloc[:, 1:], 100, endpoint_name, "text/csv")

Compute performance metrics on the training,validation, test data sets

compute auc/ginni

[ ]:
print("Training AUC", roc_auc_score(train_labels, preds_train_lin))  ##0.9091
print("Validation AUC", roc_auc_score(val_labels, preds_val_lin))  ###0.8998
print("Test AUC", roc_auc_score(test_labels, preds_test_lin))  ###0.9033


Perform simple average of the two models and evaluate on training, validaion and test sets

[ ]:
ens_train = 0.5 * np.array(preds_train_xgb) + 0.5 * np.array(preds_train_lin)
ens_val = 0.5 * np.array(preds_val_xgb) + 0.5 * np.array(preds_val_lin)
ens_test = 0.5 * np.array(preds_test_xgb) + 0.5 * np.array(preds_test_lin);


Evaluate the combined ensemble model

[ ]:
# Print AUC of the combined model
print("Train AUC- Xgboost", round(roc_auc_score(train_labels, preds_train_xgb), 5))
print("Train AUC- Linear", round(roc_auc_score(train_labels, preds_train_lin), 5))
print("Train AUC- Ensemble", round(roc_auc_score(train_labels, ens_train), 5))

print("Validation AUC- Xgboost", round(roc_auc_score(val_labels, preds_val_xgb), 5))
print("Validation AUC- Linear", round(roc_auc_score(val_labels, preds_val_lin), 5))
print("Validation AUC- Ensemble", round(roc_auc_score(val_labels, ens_val), 5))

print("Test AUC- Xgboost", round(roc_auc_score(test_labels, preds_test_xgb), 5))
print("Test AUC- Linear", round(roc_auc_score(test_labels, preds_test_lin), 5))
print("Test AUC- Ensemble", round(roc_auc_score(test_labels, ens_test), 5))

Save Final prediction on test-data

[ ]:
final = pd.concat([data_test.iloc[:, 0], pd.DataFrame(ens_test)], axis=1)
final.to_csv("Xgboost-linear-ensemble-prediction.csv", sep=",", header=False, index=False)
This example analyzed a relatively small dataset, but utilized SageMaker features such as, * managed single-machine training of XGBoost model * managed training of Linear Learner * highly available, real-time model hosting, * doing a batch prediction using the hosted model * Doing an ensemble of Xgboost and Linear Learner

This example can be extended in several ways using SageMaker features such as, * Distributed training of Xgboost/Linear model * Picking a different model for training * Training a separate model for peforming the ensemble instead of a taking a simple average.

