Solving Multi-Period Newsvendor Problem with Amazon SageMaker RL

This notebook shows an example of how to use reinforcement learning to solve a version of online stochastic Newsvendor problem. This problem is well-studied in inventory management wherein one must decide on an ordering decision (how much of an item to purchase from a supplier) to cover a single period of uncertain demand. The objective is to trade-off the various costs incurred and revenues achieved during the period, usually consisting of sales revenue, purchasing and holding costs, loss of goodwill in the case of missed sales, and the terminal salvage value of unsold items.


Problem Statement

The case considered here is stationary, single-product and multi-period, with vendor lead time (VLT) and stochastic demand. The VLT \(l\) refers to the number of time steps between the placement and receipt of an order. When formulated as a Markov Decision Process (MDP), the RL agent is aware of the following information at each time step:

  • Mean demand

  • Item purchase cost, sold price

  • lost sale penalty for each unit of unmet demand, holding cost if any unit is left over at the end of a period

  • on-hand inventory and the units to be received within the next VLT periods

At each time step, the agent can take a continuous action, consisting of the size of the order placed and to arrive \(l\) time periods later. The reward is then calculated as the difference between revenue from selling and cost of buying/storing.

The time horizon is 40 steps. You can see the specifics in the NewsVendorGymEnvironment class in A normalized version(NewsVendorGymEnvironmentNormalized) of this problem is used in this notebook.

Using Amazon SageMaker RL

Amazon SageMaker RL allows you to train your RL agents in cloud machines using docker containers. You do not have to worry about setting up your machines with the RL toolkits and deep learning frameworks. You can easily switch between many different machines setup for you, including powerful GPU machines that give a big speedup. You can also choose to use multiple machines in a cluster to further speedup training, often necessary for production level loads.


Roles and permissions

To get started, we’ll import the Python libraries we need, set up the environment with a few prerequisites for permissions and configurations.

[ ]:
import sagemaker
import boto3
import sys
import os
import glob
import re
import subprocess
from IPython.display import HTML
import time
from time import gmtime, strftime

from misc import get_execution_role, wait_for_s3_object, wait_for_training_job_to_complete
from sagemaker.rl import RLEstimator, RLToolkit, RLFramework

Setup S3 bucket

Set up the linkage and authentication to the S3 bucket that you want to use for checkpoint and the metadata.

[ ]:
sage_session = sagemaker.session.Session()
s3_bucket = sage_session.default_bucket()
s3_output_path = "s3://{}/".format(s3_bucket)
print("S3 bucket path: {}".format(s3_output_path))

Define Variables

We define variables such as the job prefix for the training jobs and the image path for the container (only when this is BYOC).

[ ]:
# create a descriptive job name
job_name_prefix = "rl-newsvendor"

framework = "tf"

Configure where training happens

You can train your RL training jobs using the SageMaker notebook instance or local notebook instance. In both of these scenarios, you can run the following in either local or SageMaker modes. The local mode uses the SageMaker Python SDK to run your code in a local container before deploying to SageMaker. This can speed up iterative testing and debugging while using the same familiar Python SDK interface. You just need to set local_mode = True. When setting local_mode = False, you can choose the instance type from avaialable ml instances

[ ]:
local_mode = False

if local_mode:
    instance_type = "local"
    # If on SageMaker, pick the instance type
    instance_type = "ml.m5.large"

Create an IAM role

Either get the execution role when running from a SageMaker notebook instance role = sagemaker.get_execution_role() or, when running from local notebook instance, use utils method role = get_execution_role() to create an execution role.

[ ]:
    role = sagemaker.get_execution_role()
    role = get_execution_role()

print("Using IAM role arn: {}".format(role))

Install docker for local mode

In order to work in local mode, you need to have docker installed. When running from you local machine, please make sure that you have docker and docker-compose (for local CPU machines) and nvidia-docker (for local GPU machines) installed. Alternatively, when running from a SageMaker notebook instance, you can simply run the following script to install dependenceis.

Note, you can only run a single local notebook at one time.

[ ]:
# only run from SageMaker notebook instance
if local_mode:
    !/bin/bash ./common/

Setup the environment

The environment is defined in a Python file called in the ./src directory. It implements the init(), step() and reset() functions that describe how the environment behaves. This is consistent with Open AI Gym interfaces for defining an environment.

  • Init() - initialize the environment in a pre-defined state

  • Step() - take an action on the environment

  • reset()- restart the environment on a new episode

  • [if applicable] render() - get a rendered image of the environment in its current state

[ ]:
# uncomment the following line to see the environment
# !pygmentize src/

Write the training code

The training code is written in the file which is also uploaded in the /src directory. First import the environment files and the preset files, and then define the main() function.

[ ]:
!pygmentize src/

Train the RL model using the Python SDK Script mode

If you are using local mode, the training will run on the notebook instance. When using SageMaker for training, you can select a GPU or CPU instance. The RLEstimator is used for training RL jobs.

  1. Specify the source directory where the gym environment and training code is uploaded.

  2. Specify the entry point as the training code

  3. Specify the choice of RL toolkit and framework. This automatically resolves to the ECR path for the RL Container.

  4. Define the training parameters such as the instance count, job name, S3 path for output and job name.

  5. Specify the hyperparameters for the RL agent algorithm. The RLCOACH_PRESET or the RLRAY_PRESET can be used to specify the RL agent algorithm you want to use.

  6. Define the metrics definitions that you are interested in capturing in your logs. These can also be visualized in CloudWatch and SageMaker Notebooks.

Define Metric

A list of dictionaries that defines the metric(s) used to evaluate the training jobs. Each dictionary contains two keys: ‘Name’ for the name of the metric, and ‘Regex’ for the regular expression used to extract the metric from the logs.

[ ]:
metric_definitions = [
        "Name": "episode_reward_mean",
        "Regex": "episode_reward_mean: ([-+]?[0-9]*\\.?[0-9]+([eE][-+]?[0-9]+)?)",
        "Name": "episode_reward_max",
        "Regex": "episode_reward_max: ([-+]?[0-9]*\\.?[0-9]+([eE][-+]?[0-9]+)?)",
        "Name": "episode_len_mean",
        "Regex": "episode_len_mean: ([-+]?[0-9]*\\.?[0-9]+([eE][-+]?[0-9]+)?)",
    {"Name": "entropy", "Regex": "entropy: ([-+]?[0-9]*\\.?[0-9]+([eE][-+]?[0-9]+)?)"},
        "Name": "episode_reward_min",
        "Regex": "episode_reward_min: ([-+]?[0-9]*\\.?[0-9]+([eE][-+]?[0-9]+)?)",
    {"Name": "vf_loss", "Regex": "vf_loss: ([-+]?[0-9]*\\.?[0-9]+([eE][-+]?[0-9]+)?)"},
    {"Name": "policy_loss", "Regex": "policy_loss: ([-+]?[0-9]*\\.?[0-9]+([eE][-+]?[0-9]+)?)"},

Define Estimator

This Estimator executes an RLEstimator script in a managed Reinforcement Learning (RL) execution environment within a SageMaker Training Job. The managed RL environment is an Amazon-built Docker container that executes functions defined in the supplied entry_point Python script.

[ ]:
train_entry_point = ""
train_job_max_duration_in_seconds = (
    60 * 15
)  # 15 mins to make sure TrainingJobAnalytics shows at least two points

estimator = RLEstimator(
[ ]:
job_name = estimator.latest_training_job.job_name
print("Training job: %s" % job_name)


RL training can take a long time. So while it’s running there are a variety of ways we can track progress of the running training job. Some intermediate output gets saved to S3 during training, so we’ll set up to capture that.

[ ]:
s3_url = "s3://{}/{}".format(s3_bucket, job_name)

intermediate_folder_key = "{}/output/intermediate/".format(job_name)
intermediate_url = "s3://{}/{}training/".format(s3_bucket, intermediate_folder_key)

print("S3 job path: {}".format(s3_url))
print("Intermediate folder path: {}".format(intermediate_url))

Plot metrics for training job

We can see the reward metric of the training as it’s running, using algorithm metrics that are recorded in CloudWatch metrics. We can plot this to see the performance of the model over time.

[ ]:
%matplotlib inline
from import TrainingJobAnalytics
[ ]:
if not local_mode:
    wait_for_training_job_to_complete(job_name)  # Wait for the job to finish
    df = TrainingJobAnalytics(job_name, ["episode_reward_mean"]).dataframe()
    df_min = TrainingJobAnalytics(job_name, ["episode_reward_min"]).dataframe()
    df_max = TrainingJobAnalytics(job_name, ["episode_reward_max"]).dataframe()
    df["rl_reward_mean"] = df["value"]
    df["rl_reward_min"] = df_min["value"]
    df["rl_reward_max"] = df_max["value"]
    num_metrics = len(df)

    if num_metrics == 0:
        print("No algorithm metrics found in CloudWatch")
        plt = df.plot(
            figsize=(18, 6),
            color=["b", "r", "g"],
        plt.fill_between(df.timestamp, df.rl_reward_min, df.rl_reward_max, color="b", alpha=0.2)
        plt.set_ylabel("Mean reward per episode", fontsize=20)
        plt.set_xlabel("Training time (s)", fontsize=20)
        plt.legend(loc=4, prop={"size": 20})
    print("Can't plot metrics in local mode.")

Monitor training progress

You can repeatedly run the visualization cells to get the latest metrics as the training job proceeds.

Training Results

You can let the training job run longer by specifying train_max_run in RLEstimator. The figure below illustrates the reward function of the RL policy vs. that of Critical Ratio, a classic heuristic. The experiments are conducted on a p3.8x instance. For more details on the environment setup and how different parameters are set, please refer to ORL: Reinforcement Learning Benchmarks for Online Stochastic Optimization Problems.