Implementing a Recommender System for Implicit Feedback Datasets with Bayesian Personalized Ranking

Classes of recommender systems:

  1. Content-based recommender systems: Rely on the product’s features, attributes, and descriptions to recommend other products that are similar to their past purchases or any form of explicit feedback data.

  2. Personalized Ranking-based recommender systems: Recommend the top-n items for a particular user along with a ranking and a score.

  3. Collaborating Filtering-based recommender systems: Rely on the user’s prior item interactions or ratings to make a recommendation

    1. User-based: Operate by discovering the other identical/like-minded users.

    2. Item-based: Works based on the similarity between items assessed using user’s ratings of those items/ interactions.

  4. Location/Demographics: The user’s demographic knowledge is applied to acquire a classifier that can outline particular demographics to ratings or purchasing capacities.

  5. Hybrid: A blend of more than one of the above strategies.

Explicit vs. Implicit Users Feedback:

  1. Explicit Feedback: A dataset collected primarily based on the user’s behavior or explicitly posted by the user in the system, so-called explicit feedback. Examples include movie ratings on Netflix, provided explicitly by the users, or ratings of products by users on

  2. Implicit Feedback: Rather than relying on explicit user feedback, the system can indirectly utilize user behavior and interactions to learn about their interests and choices. This implicit feedback is handy in many different domains. For instance, the system may see a user purchasing or browsing an item as an endorsement for that item or even the number of times they played a particular song.

About this Notebook

  1. Pre-requisites: This example notebook requires a subscription to Implicit BPR at AWS Marketplace listing which represents an implementation of an algorithm utilizing Bayesian Personalized Ranking.

  2. You will demonstrate how to employ an Implicit BPR Algorithm by using Amazon SagaMaker to collect, analyze, clean, prepare, train and deploy the model to perform both the Batch and Real-time Inferences on the Online Retail Data Set.

  3. This Online Retail Data Set holds all the transactions occurring for a UK-based and registered, non-store online retail between 01/12/2009 and 09/12/2011. The company mainly sells unique all-occasion gift-ware and many customers of the company are wholesalers.

  4. Inside the AWS SageMaker Studio, start the “Data Science” kernel, which powers all of our notebook interactions

    1. Click on “No Kernel” in the Upper Right
    2. Select the “Data Science Kernel”
    3. Confirm the Kernel is Started in Upper Right
  5. If you want to run this notebook on AWS SageMaker Notebook Instances

    1. Please use Classic Jupyter mode to be able correctly to render visualization. Pick instance type ‘ml.c5.2xlarge’ or larger.
    2. Set kernel to ‘conda_python3’
  6. You can run this notebook one cell at a time (By using Shift+Enter for running a cell) OR run all the cells at once by selecting “Run All Cells” image

Note: You cannot continue until the kernel is started. Please wait until the kernel is started before continuing !!!!

Step 1: Pre-requisites: subscribe to Implicit BPR Algorithm from AWS Marketplace

  1. Open **Implicit BPR** listing from AWS Marketplace in your browser.

  2. Read the Highlights section and then the product overview section of the listing.

  3. View usage information and then additional resources.

  4. Note the supported instance types and specify the same in the following cell.

  5. Next, click on “Continue to Subscribe”. You will now see the “Subscribe to this software” page.

  6. Review End user license agreement, support terms, as well as pricing information.

  7. Next, “Accept Offer” button needs to be clicked only if your organization agrees with EULA, pricing information as well as support terms. Once Accept Offer button has been clicked, specify compatible training and inference types you wish to use.

  8. Once you click on “Continue to Configuration” button and then choose a region, you will see a Product Arn displayed. This is the algorithm ARN that you need to specify in the following cell


  1. If the continuous configuration button is active, your account already has a subscription to this listing.

  2. Once you click on Continue to configuration button and then choose region, you will see that a product ARN will appear. This is the algorithm ARN that you need to specify in your training job.

[ ]:
algorithm_arn = "arn:aws:sagemaker:us-east-1:865070037744:algorithm/implicit-bpr-36-3af996544083749e141ffc9ef7d99399"

Step 2: Set up environment

[ ]:
# Install necessary libraries and their required versions. Please ignore all WARNINGs and ERRORs from the pip install's below.
import sys

!{sys.executable} -m pip install --disable-pip-version-check -q pandas==1.1.5
!{sys.executable} -m pip install --disable-pip-version-check -q numpy==1.19.5
[ ]:
# Import necessary libraries.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import time
import sagemaker
import json
import boto3

from sagemaker import AlgorithmEstimator
from sagemaker import get_execution_role, local, Model, utils, fw_utils, s3
from sagemaker.predictor import json_serializer
from import TrainingJobAnalytics
from sklearn.model_selection import train_test_split
from botocore.exceptions import ClientError
from io import StringIO
from urllib.parse import urlparse
from IPython.display import Markdown as md, display

# Print settings
pd.set_option("display.max_columns", 500)
pd.set_option("display.max_rows", 10)

# Account/Role Setup
sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name
role_arn = get_execution_role()

boto3_sm_runtime = boto3.client("sagemaker-runtime")
boto3_sm_client = boto3.client("sagemaker")
boto_s3_client = sagemaker_session.boto_session.client("s3")
[ ]:
# S3 Setup
s3_bucket = sagemaker_session.default_bucket()
s3_common_prefix = "sagemaker/implicit-bpr"
s3_training_prefix = s3_common_prefix + "/training"
s3_testing_prefix = s3_common_prefix + "/test"
s3_training_input_prefix = s3_training_prefix + "/data"
s3_testing_input_prefix = s3_testing_prefix + "/data"
s3_training_jobs_prefix = s3_training_prefix + "/jobs"
s3_training_data_file_name = "cleaned_online_retail_train_data.csv"
s3_test_data_file_name = "cleaned_online_retail_test_data.csv"

# S3 batch request inputs
s3_batch_input_dir_prefix = s3_common_prefix + "/batch-inference/jobs"
s3_batch_request_file_name = "recommendation.requests"

# Construct training and transform job Name
job_name_prefix = "implicit-bpr-online-retail-training"
job_output_path = "s3://{}/{}/{}/".format(s3_bucket, s3_training_jobs_prefix, job_name_prefix)
transform_job_name_prefix = "implicit-bpr-online-retail-batch-transform"

# Define the different ML instance types
compatible_training_instance_type = "ml.c5.2xlarge"
compatible_batch_transform_instance_type = "ml.c5.2xlarge"
compatible_real_time_inference_instance_type = "ml.c5.2xlarge"

# Reference to the data directory
DATA_DIR = "data"
# Reference to the original dataset
DATASET_DIR = DATA_DIR + "/" + "dataset"

# Construct the training directory to hold the training data
TRAINING_WORKDIR = DATA_DIR + "/" + "training"
train_data_file = TRAINING_WORKDIR + "/" + s3_training_data_file_name

# Construct the testing directory to hold the testing data
TEST_WORKDIR = DATA_DIR + "/" + "testing"
test_data_file = TEST_WORKDIR + "/" + s3_test_data_file_name

# Construct directory to hold the batch transform request paylod
BATCH_REQUEST_WORKDIR = DATA_DIR + "/" + "batch-requests"
batch_request_data_file = BATCH_REQUEST_WORKDIR + "/" + s3_batch_request_file_name
[ ]:
# Create above directories on the Notebook which will be used to hold the data for training, testing and the batch requests payload

Let us define utility functions we can use later to print the purchase information in human readable format.

[ ]:
# This function prints the top <limit_top_rows> purchased for a given Customer ID<customer_id> from the original data set
def display_original_purchase_history(original_purchase_df, customer_id, limit_top_rows=5):
    original_purchases = original_purchase_df.loc[original_purchase_df["CustomerID"] == customer_id]
    original_purchases = original_purchases[
        ["CustomerID", "StockCode", "Description", "Quantity", "Invoice", "InvoiceDate"]
            "**[ <u>Top {} Original Purchase History</u> ] for a Customer ID : {}**".format(
                limit_top_rows, customer_id
    return original_purchases.head(limit_top_rows).style.hide_index()

# Function takes dataframe containing the inference results either from the batch transform/realtime for a given Customer ID
# Perform a join to the product lookup table to pull the product descriptions and display the results
def display_inference_result(inference_result_df, customer_id, inference_type):
    inference_result_df = inference_result_df.rename(
        columns={"user_id": "CustomerID", "item_id": "StockCode", "score": "Recommendation Score"}
    inference_result_df["StockCode"] = inference_result_df.StockCode.astype(str)
    stock_code_desc = stock_code_desc_look_up.groupby(["StockCode"]).agg(lambda x: x.iloc[0])[
    inference_result_df = inference_result_df.join(stock_code_desc, on="StockCode")
    inference_result_df = inference_result_df[
        ["CustomerID", "StockCode", "Description", "Recommendation Score"]

    if inference_type is "batch":
        inference_result_df = inference_result_df.loc[
            inference_result_df["CustomerID"] == customer_id
                "**[ <u>Batch Transform</u> ] Recommended Items with the Ranking for a Customer ID : {}**".format(
    elif inference_type is "realtime":
                "**[ <u>Real-Time Inference</u> ] Recommended Items with the Ranking for a Customer ID : {}**".format(
    return display(

Step 3: Data collection and preparation

The Online Retail Data Set you will use is provided by UCI Machine Learning. The dataset contains all the transactions occurring for a UK-based and registered, non-store online retail from 01/12/2009 to 09/12/2011. The company sells unique all-occasion gift-ware. Many customers of the company are wholesalers.

Facts regarding the downloaded dataset: 1. The spreadsheet holds two separate sheets inside it. 2. The first one holds the transaction between the year 2009-2010, and the other has the data between 2010-2011. 3. Individual sheet comprises more than 500k instances, so combine, we would have approximately 1067371 representations to explore and prepare for our use case 4. Dataset Attribute Information: 1. CustomerID (Nominal): Customer number. A 5-digit integral number uniquely assigned to each customer. 2. StockCode (Nominal): Product (item) code. A 5-digit integral number uniquely assigned to each distinct product. 3. Description (Nominal): Product (item) name. 4. Price (Numeric): Unit price. Product price per unit in sterling (£). 5. Quantity (Numeric): The quantities of each product (item) per transaction. 6. Invoice (Nominal): Invoice number. A 6-digit integral number uniquely assigned to each transaction. If this code starts with the letter ‘c’, it indicates a cancellation. 7. InvoiceDate (Numeric): Invoice date and time. The day and time when it generated a transaction. 8. Country (Nominal): Country name. The name of the country where a customer resides.

Step 3.1: Ingesting the Online Retail Dataset and load them inside pandas dataframe

[ ]:
# First, you will download the dataset in a data folder using the cell below.
!cd $DATASET_DIR && wget -N ''

Next, you will examine the downloaded datasets into the pandas dataframe and uncover the top 5 records and their datatypes. The estimated wall time for the below cell is around ~2min 20s.

[ ]:
# This would take some time as we would be filling close 1 millon transactions from the downloaded file and creating a dataframe
raw_excel_data = pd.read_excel(
    DATASET_DIR + "/online_retail_II.xlsx",
    sheet_name=None, engine='openpyxl',

# Since the file contains different two sheets from two years, let us combine and create a single dataframe
online_retail_data = pd.concat(raw_excel_data, axis=0, ignore_index=True)

# Print the recently created Panda Dataset's dimensionality and first five records
online_retail_data = online_retail_data[


As you can see, the data set has 1067371 instances and eight different features.

Step 3.2: Exploring, cleansing and converting dataset into the format accepted by an algorithm

The user-item-interaction data is critical for getting started with the recommender system and its training for many use-cases such as Video-on-demand applications, user click-stream logs, user’s purchase history, etc. No matter the use case, the algorithms all share a base of learning on user-item-interaction data, which is defined by two core attributes:

1. user_id - The user who interacted
2. item_id - The item the user interacted with

The Implicit BPR Algorithm requires the training dataset to contain 'user_id' and 'item_id' columns. In this case, it would be *``CustomerID``*, and the items they have purchased/interacted with would *``StockCode``* respectively. Additionally, they must not include any missing values, and the input file must be in a CSV format.


Let us evaluate and confirm that our dataset for nulls.

[ ]:
# Checking data types and the missing values in the data
    "[INFO]: Dataframe with the missing Customer IDs, its datatype values from the original dataset"

You can see that the dataset has close to 243007 missing values in the *``CustomerID``* column. Next, you will clean up the dataset by eliminating those missing/null rows.

[ ]:
# Removing the rows that do not have a Customer ID
online_retail_data = online_retail_data.loc[pd.isnull(online_retail_data.CustomerID) == False]
# Convert to int for customer ID
online_retail_data["CustomerID"] = online_retail_data.CustomerID.astype(int)
# Validate data types and the missing values in the data post eliminating the missing rows
    "\n[INFO]: Dataframe post eliminating Customer IDs, its datatype values from the original dataset"

    "\n[INFO]: The dataset now have no null rows for the CustomerID and have approximately 824364 occurrences to analyze further and train our model"

Step 3.3: Preparing the final training dataset and upload it to Amazon S3

To better visualize our model’s recommendation, let us assemble a product lookup table that we can use later to map with the inference results. We can also eliminate the optional columns which are not needed for the training. Finally, lets rename the columns ``”CustomerID”`` -> ``”user_id”`` and ``”StockCode”`` -> ``”item_id”`` as expected by the algorithm spec.

[ ]:
# Build a lookup table for stock code information
stock_code_desc_look_up = online_retail_data[["StockCode", "Description"]].drop_duplicates()
stock_code_desc_look_up["StockCode"] = stock_code_desc_look_up.StockCode.astype(str)
print("[INFO]: Stock Code lookup table")

# Remove the optional columns which are not claimed for the training
cleaned_online_retail_data = online_retail_data
cleaned_online_retail_data = cleaned_online_retail_data[["CustomerID", "StockCode"]]

# Lastly, lets rename the columns "CustomerID" -> "item_id" and "StockCode" -> "item_id" as required by the algothirm specification
cleaned_online_retail_data = cleaned_online_retail_data.rename(
    columns={"CustomerID": "user_id", "StockCode": "item_id"}
print("[INFO]: Head of the final dataset post renaming the headers of the columns")

    "[INFO] Our dataset is ultimately ready and satisfies all the required algorithm specification. We hold approximately {} user-item interactions to train our model.".format(

Next, let us split the dataset into the training and the testing that you can use to train and evaluate the performance of the model

[ ]:
# Split the dataset into the training and the testing with 70% for training and 30% for testing
train_set, test_set = train_test_split(
    cleaned_online_retail_data, train_size=0.70, test_size=0.30, random_state=41

    "[INFO] The size of the training dataset is {}, and the size of the testing dataset is {}.".format(
        len(train_set.index), len(test_set.index)

Let us create a CSV file for both the training and the testing dataset and upload them to the S3 bucket

[ ]:
train_set[["user_id", "item_id"]].to_csv(train_data_file, index=False)

# Upload the training dataset to S3
training_input = sagemaker_session.upload_data(
    TRAINING_WORKDIR, s3_bucket, key_prefix=s3_training_input_prefix
print("[INFO] Uploaded training data location " + training_input)
[ ]:
test_set[["user_id", "item_id"]].to_csv(test_data_file, index=False)

# Upload the test dataset to S3
test_input = sagemaker_session.upload_data(
    TEST_WORKDIR, s3_bucket, key_prefix=s3_testing_input_prefix
print("[INFO] Uploaded testing data location " + test_input)

Congratulations. You have performed the ingestion, exploration, and generation of a clean training dataset file that meets the requirement. You have also uploaded it to the S3 bucket, and can be used for training a model.

Step 4: Train the model and evaluate the performance metrics

Step 4.1: Train the model

To train a model, you create a training job. After you start the training job, SageMaker launches the ML compute instances and uses the training code you provided to train the model. It then saves the resulting model artifacts and other output in the S3 bucket.

Next, let us form and start a training job with the training dataset we uploaded to the S3 bucket and wait for the completion. The estimated wall time for the below cell is around ~4min 20s.

[ ]:
timestamp = time.strftime("-%Y-%m-%d-%H-%M-%S", time.gmtime())
job_name = job_name_prefix + timestamp

print("[INFO] Creating a training job with name: " + job_name)

# Configure an Estimator (no training happens yet)
estimator = AlgorithmEstimator(

inputs = {"training": training_input, "testing": test_input}
# Starts a SageMaker training job and waits until completion, logs="Training", job_name=job_name)

print("[INFO] Training the model has been completed sucessfully.")
print("[INFO] Model artifact can be found at: " + estimator.output_path + job_name)

Step 4.2: Evaluate and visualize the performance metrics

As you know, you are most likely interested in promoting top-N items to the users within the context of recommendation systems. So it is essential to measure the Precision and Recall metrics in the top-N items rather than all the items. Thus the idea of precision and recall at k where k is a user-defined integer to match the top-N recommendations’ objective.

In other words, out of all the top N items the system would recommend, how many are relevant to the user? You can visualize the metrics p@k(10) produced from the training job inline using the Amazon SageMaker Python SDK APIs from the next cell.

[ ]:
# Training Job Name
training_job_name = job_name

# Metric name as per the algorithm specifications
metric_name = "p@k(10)"
# Retrieve the Training job details and build a plot
metrics_dataframe = TrainingJobAnalytics(
    training_job_name=training_job_name, metric_names=[metric_name]
plt = metrics_dataframe.plot(
    title="Precision at 10 in a top-10 Recommendation",
    figsize=(10, 5),
plt.set_ylabel("Average precision at k(10)");

You can see from the training job logs that the algorithm could produce the Precision at 10 in a top-10 recommendation problem is :sub:`83%**. It means that **`83% of the recommendation the system made are relevant to the user.

Step 5: Perform a batch/offline inference

Batch Transform: To get the inferences on an entire dataset offline, you run a batch transform job on a trained model. Batch transform automatically manages the processing of large datasets within the limits of specified parameters. When a batch transform job starts, SageMaker initializes ML compute instances and distributes the inference or preprocessing workload between them. Batch Transform partitions the Amazon S3 objects in the input by key and maps Amazon S3 objects to instances.

In this step, you will first identify sample users and prepare an input payload. Then you will run a batch transform job, and finally, you will look at the recommendations for the sample users.

Step 5.1: Identify a customer and understand their purchase history

Let us identify sample customers who have purchased three separate items of various kinds.

[ ]:
# Take Customer ID: 13085 for our analysis from the original dataset
sample_customer_id = 13085
# Let us present the top 10 original purchase history for this customer.
display_original_purchase_history(online_retail_data, sample_customer_id, 10)

As you can see, this customer likes purchasing many different kinds of Lights, Doormats, various Bowls. Let us build a request payload for this customer and examine what new items the deployed model would recommend?

[ ]:
# Build a local work dir where you would create the batch transform request file in a JSON format
# Populate the requested file with the preceding example users.
with open(batch_request_data_file, "w") as outfile:
    json.dump({"user_id": str(sample_customer_id), "top_n": "10"}, outfile)
[ ]:
# Print the head of the payload file
!head {batch_request_data_file}

Step 5.2: Upload the payload to Amazon S3 and run a batch transform job

In this section, you will upload the data to S3 and run a batch transform job. The estimated wall time for the transform job is around ~6min 30s.

[ ]:
# Build Transform Job Name
timestamp = time.strftime("-%Y-%m-%d-%H-%M-%S", time.gmtime())
transform_job_name = transform_job_name_prefix + timestamp
transform_job_inference_path = "{}/{}".format(s3_batch_input_dir_prefix, transform_job_name)
transform_job_inference_output = "s3://" + s3_bucket + "/" + transform_job_inference_path

# Upload the batch transform request JSON file dataset to Amazon S3 buket
uploaded_batch_inference_request = sagemaker_session.upload_data(
    batch_request_data_file, s3_bucket, key_prefix=transform_job_inference_path
print("[INFO] S3 batch requests data location " + uploaded_batch_inference_request)
[ ]:
# Build the Transformer Object with the parameters
print("[INFO] Starting the batch transform job: " + transform_job_name)
transformer = estimator.transformer(
# Strat the Transformer Job

# Wait until the job completes
    "[INFO] The batch transform job has been completed, and the output has been saved to : "
    + transformer.output_path

Next, lets us examine the Batch Transform output in S3.

[ ]:
parsed_url = urlparse(transformer.output_path)
bucket_name = parsed_url.netloc
file_key = "{}/{}.out".format(parsed_url.path[1:], s3_batch_request_file_name)

response = boto_s3_client.get_object(Bucket=sagemaker_session.default_bucket(), Key=file_key)
s3_response_bytes = response["Body"].read().decode("utf-8")


As you can see, the inference output not only includes the User Id, Item Id that it would recommend but also consists of the Ranking Score in an order which is most relevant to this user.

Step 5.3: Join the result with the stock lookup table to associate the item information

[ ]:
# Read the Batch transform response and create a Panda's Dataframe for more useful visualization.
batch_inference_response = StringIO(s3_response_bytes)

recommendations_df = pd.read_csv(
    batch_inference_response, header=None, names=["CustomerID", "StockCode", "Recommendation Score"]

# Model inference result to associate and endorse our predictions
display_inference_result(recommendations_df, sample_customer_id, "batch")

You can see that the model correctly predicted the next few articles that this customer would choose to like to purchase, such as different additional Lights and Bathroom Curtain Sets, including other items Cups.

Step 6: Deploy the model and perform a real-time inference

Step 6.1: Deploy an endpoint

The *estimator.deploy* method creates the deployable model, configures the SageMaker hosting services endpoint, and launches the endpoint to host the model. The estimated wall time for the below cell is around ~7min 25s.

[ ]:
# Creates the deployable model, configures the SageMaker hosting services endpoint, and launches the endpoint to host the model.
predictor = estimator.deploy(
    1, compatible_real_time_inference_instance_type, serializer=json_serializer
print("[INFO] The model endpoint has been deployed successfully")

Step 6.2: Take the example user, create the JSON payload and make an inference request

Let’s take another customer for our analysis from the original dataset and make an online inference request. You will be able to see the JSON response received from SageMaker Model Endpoint.

[ ]:
example_customer_id = 17519
# Let us present the top 10 original purchase history for this customer.
display_original_purchase_history(online_retail_data, example_customer_id, 10)

As you can see, this customer prefers purchasing events related decorative items such as Paper dollies, Banners, and Assorted items. Let us build a request payload for this customer and review what new things the deployed model would recommend?

[ ]:
response_dict = ""
top_n = 10

# Build the JSON Inference request
json_inference_request = {"user_id": str(example_customer_id), "top_n": str(top_n)}
# Make an Inference Request to the deployed Endpoint
response = boto3_sm_runtime.invoke_endpoint(
inference_response_body = response["Body"].read().decode("utf-8")
response_dict = eval(inference_response_body)
print("[INFO] JSON Response received from SageMaker Model Endpoint: ")

Step 6.3: Join the result with the stock lookup table to associate the item information

[ ]:
if len(response_dict) > 0:
    online_response_df = pd.read_json(json.dumps(response_dict))
    # Model inference result to associate and endorse our predictions
    display_inference_result(online_response_df, example_customer_id, "realtime")
        "[INFO] No response received for the request with Real-Time Inference for CustomerID {}.".format(

You can see that the model correctly predicted the next few articles that this customer would choose to like to purchase, such as Hanging tags, Cake stands, Garland including, Gift Tags, and various Birthday signs.

Step 7: Cleaning up the Resources

To avoid incurring unnecessary costs, delete the resources you created, such as deployed Amazon SageMaker Model endpoint and the deployed model, downloaded external datasets, and temporary ones made on this Notebook.

[ ]:
[ ]:

Unsubscribe the product from AWS Marketplace (optional)

Lastly, if the AWS Marketplace subscription was created just for the experiment and you would like to unsubscribe to the product, you can follow the following steps. Before you cancel the subscription, ensure that you do not have any deployable model created from the model package using the algorithm. Note - You can find this information by looking at the container name associated with the model.

  1. Navigate to Machine Learning tab on **Your Software subscriptions page**

  2. Locate the listing that you would need to cancel the subscription for, and then you can click** Cancel subscription** to cancel the subscription