# Stock Trading with Amazon SageMaker RL

This notebook’s CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

In this notebook, we apply the deep Q-network method to train an agent that will trade a single share to maximize profit. The goal is to demonstrate how to go beyond the Atari games and apply RL to a different practical domain. Based on the setup in chapter 8 of [1], we use one-minute interval historical share price series, and then apply a Double DQN architecture to accommodate a simple set of discrete trading actions: do nothing, buy a single share, and close the position. The customized environment is constructed using Open AI Gym and the RL agents are trained using Amazon SageMaker.

[1] Maxim Lapan. “Deep Reinforcement Learning Hands-On.” Packt (2018).

## Problem Statement

The RL problem for stock trading can be defined as:

1. Objective: The portfolio consists of a single stock. The goal is to teach an agent when the best time is to buy one single share and then close the position to maximize profit.

2. Environment: Custom developed environment using Gym.

3. State: Single vector, which includes prices (open, high, low, close) and two numbers indicating the presence of a bought share and position profit (profit or loss we currently have from our open position). For more details, please refer to [1].

4. Action: Do nothing: Skip the step without taking action. Buy a share: If the agent has already got the share, nothing will be bought, otherwise we’ll pay a commission, which is a small percentage of the current price. Close the position: If we’ve got no share previously bought, nothing will happen, otherwise we’ll pay the commission for the trade.

5. Reward: The agent receives a reward after each step. If a new position is entered or an existing position is closed the reward equals $$100 * \frac{(y_t - y_{t-1})}{y_{t-1}}$$ minus the commission, where $$y_t$$ corresponds to the closing price in period $$t$$. If a position continues to exist the reward equals $$100 * \frac{(y_t - y_{t-1})}{y_{t-1}}$$. If you set reward_on_close=True in the custom environment file, the agent receives a reward only once after closing the position.

## Dataset

In this notebook, we use the dataset by Maxim Lapan. It contains the historic one-minute interval price series of Yandex company stock, including date, time, open, high, low, close and volume from 2016-01-01 to 2016-12-31.

### Dataset License

This dataset is licensed under a MIT License.

Copyright (c) 2019 Packt

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

## Using Amazon SageMaker for RL

Amazon SageMaker allows you to train your RL agents in cloud machines using docker containers. You do not have to worry about setting up your machines with the RL toolkits and deep learning frameworks. You can easily switch between many different machines setup for you, including powerful GPU machines that give a big speedup. You can also choose to use multiple machines in a cluster to further speedup training, often necessary for production level loads.

## Pre-requisites

### Roles and permissions

To get started, we’ll import the Python libraries we need, set up the environment with a few prerequisites for permissions and configurations.

[ ]:

import sagemaker
import boto3
import sys
import os
import glob
import re
import subprocess
from IPython.display import HTML
import time
from time import gmtime, strftime

sys.path.append("common")
from misc import get_execution_role, wait_for_s3_object
from sagemaker.rl import RLEstimator, RLToolkit, RLFramework


### Steup S3 buckets

Set up the linkage and authentication to the S3 bucket that you want to use for checkpoint and the metadata.

[ ]:

sage_session = sagemaker.session.Session()
s3_bucket = sage_session.default_bucket()
s3_output_path = "s3://{}/".format(s3_bucket)
print("S3 bucket path: {}".format(s3_output_path))


### Define Variables

We define variables such as the job prefix for the training jobs.

[ ]:

# create unique job name
job_name_prefix = "rl-stock-trading"


### Configure settings

You can run your RL training jobs on a SageMaker notebook instance or on your own machine. In both of these scenarios, you can run the following in either local or SageMaker modes. The local mode uses the SageMaker Python SDK to run your code in a local container before deploying to SageMaker. This can speed up iterative testing and debugging while using the same familiar Python SDK interface. You just need to set local_mode = True.

[ ]:

# run in local mode?
local_mode = False


### Create an IAM role

Either get the execution role when running from a SageMaker notebook role = sagemaker.get_execution_role() or, when running from local machine, use utils method role = get_execution_role() to create an execution role.

[ ]:

try:
role = sagemaker.get_execution_role()
except:
role = get_execution_role()

print("Using IAM role arn: {}".format(role))


### Install docker for local mode

In order to work in local mode, you need to have docker installed. When running from you local machine, please make sure that you have docker or docker-compose (for local CPU machines) and nvidia-docker (for local GPU machines) installed. Alternatively, when running from a SageMaker notebook instance, you can simply run the following script to install dependenceis.

Note, you can only run a single local notebook at one time.

[ ]:

# Run on SageMaker notebook instance
if local_mode:
!/bin/bash ./common/setup.sh


## Set up the environment

The environment is defined in a Python file called portfolio_env.py and the file is uploaded on /src directory.

The environment also implements the init(), step() and reset() functions that describe how the environment behaves. This is consistent with Open AI Gym interfaces for defining an environment.

1. init() - initialize the environment in a pre-defined state

2. step() - take an action on the environment

3. reset()- restart the environment on a new episode

4. [if applicable] render() - get a rendered image of the environment in its current state

[ ]:

!pygmentize src/trading_env.py


## Configure the presets for RL algorithm

The presets that configure the RL training jobs are defined in the preset-portfolio-management-clippedppo.py file which is also uploaded on the /src directory. Using the preset file, you can define agent parameters to select the specific agent algorithm. You can also set the environment parameters, define the schedule and visualization parameters, and define the graph manager. The schedule presets will define the number of heat up steps, periodic evaluation steps, training steps between evaluations.

These can be overridden at runtime by specifying the RLCOACH_PRESET hyperparameter. Additionally, it can be used to define custom hyperparameters.

[ ]:

!pygmentize src/preset-stock-trading-ddqn.py


## Write the Training Code

The training code is written in the file “train-coach.py” which is uploaded in the /src directory. First import the environment files and the preset files, and then define the main() function.

[ ]:

!pygmentize src/train-coach.py


## Train the RL model using the Python SDK Script mode

If you are using local mode, the training will run on the notebook instance. When using SageMaker for training, you can select a GPU or CPU instance. The RLEstimator is used for training RL jobs.

1. Specify the source directory where the environment, presets and training code is uploaded.

2. Specify the entry point as the training code

3. Specify the choice of RL toolkit and framework. This automatically resolves to the ECR path for the RL Container.

4. Define the training parameters such as the instance count, job name, S3 path for output and job name.

5. Specify the hyperparameters for the RL agent algorithm. The RLCOACH_PRESET can be used to specify the RL agent algorithm you want to use.

6. [Optional] Choose the metrics that you are interested in capturing in your logs. These can also be visualized in CloudWatch and SageMaker Notebooks. The metrics are defined using regular expression matching.

[ ]:

if local_mode:
instance_type = "local"
else:
instance_type = "ml.m4.xlarge"

estimator = RLEstimator(
source_dir="src",
entry_point="train-coach.py",
dependencies=["common/sagemaker_rl"],
toolkit=RLToolkit.COACH,
toolkit_version="0.11.0",
framework=RLFramework.TENSORFLOW,
role=role,
instance_count=1,
instance_type=instance_type,
output_path=s3_output_path,
base_job_name=job_name_prefix,
hyperparameters={
"RLCOACH_PRESET": "preset-stock-trading-ddqn",
"rl.agent_params.algorithm.discount": 0.99,
"rl.evaluation_steps:EnvironmentEpisodes": 5,
},
)
# takes ~20min
estimator.fit()


## Store intermediate training output and model checkpoints

The output from the training job above is either stored in a local directory (local mode) or on S3 (SageMaker) mode.

[ ]:

%%time

job_name = estimator._current_job_name
print("Job name: {}".format(job_name))

s3_url = "s3://{}/{}".format(s3_bucket, job_name)

if local_mode:
output_tar_key = "{}/output.tar.gz".format(job_name)
else:
output_tar_key = "{}/output/output.tar.gz".format(job_name)

intermediate_folder_key = "{}/output/intermediate/".format(job_name)
output_url = "s3://{}/{}".format(s3_bucket, output_tar_key)
intermediate_url = "s3://{}/{}".format(s3_bucket, intermediate_folder_key)

print("S3 job path: {}".format(s3_url))
print("Output.tar.gz location: {}".format(output_url))
print("Intermediate folder path: {}".format(intermediate_url))

tmp_dir = "/tmp/{}".format(job_name)
os.system("mkdir {}".format(tmp_dir))
print("Create local folder {}".format(tmp_dir))

[ ]:

%%time

wait_for_s3_object(s3_bucket, output_tar_key, tmp_dir)

if not os.path.isfile("{}/output.tar.gz".format(tmp_dir)):
raise FileNotFoundError("File output.tar.gz not found")
os.system("tar -xvzf {}/output.tar.gz -C {}".format(tmp_dir, tmp_dir))
if not local_mode:
os.system("aws s3 cp --recursive {} {}".format(intermediate_url, tmp_dir))
if not os.path.isfile("{}/output.tar.gz".format(tmp_dir)):
raise FileNotFoundError("File output.tar.gz not found")
os.system("tar -xvzf {}/output.tar.gz -C {}".format(tmp_dir, tmp_dir))
print("Copied output files to {}".format(tmp_dir))

if local_mode:
checkpoint_dir = "{}/data/checkpoint".format(tmp_dir)
info_dir = "{}/data/".format(tmp_dir)
else:
checkpoint_dir = "{}/checkpoint".format(tmp_dir)
info_dir = "{}/".format(tmp_dir)

print("Checkpoint directory {}".format(checkpoint_dir))
print("info directory {}".format(info_dir))


## Visualization

### Plot rate of learning

We can view the rewards during training using the code below. This visualization helps us understand how the performance of the model represented as the reward has improved over time. For the consideration of training time, we restict the number of training steps. If you see the reward is still below zero, try a larger number of training steps. The number of steps can be configured in the preset file.

[ ]:

%matplotlib inline
import pandas as pd

csv_file_name = "worker_0.simple_rl_graph.main_level.main_level.agent_0.csv"
key = os.path.join(intermediate_folder_key, csv_file_name)
wait_for_s3_object(s3_bucket, key, tmp_dir)

csv_file = "{}/{}".format(tmp_dir, csv_file_name)
df = pd.read_csv(csv_file)
df = df.dropna(subset=["Training Reward"])
x_axis = "Episode #"
y_axis = "Training Reward"

plt = df.plot(x=x_axis, y=y_axis, figsize=(12, 5), legend=True, style="b-")
plt.set_ylabel(y_axis)
plt.set_xlabel(x_axis);


## Load the checkpointed models for evaluation

Checkpointed data from the previously trained models will be passed on for evaluation / inference in the checkpoint channel. In local mode, we can simply use the local directory, whereas in the SageMaker mode, it needs to be moved to S3 first.

Since TensorFlow stores ckeckpoint file containes absolute paths from when they were generated (see issue), we need to replace the absolute paths to relative paths. This is implemented within evaluate-coach.py

[ ]:

%%time

if local_mode:
checkpoint_path = "file://{}".format(checkpoint_dir)
print("Local checkpoint file path: {}".format(checkpoint_path))
else:
checkpoint_path = "s3://{}/{}/checkpoint/".format(s3_bucket, job_name)
if not os.listdir(checkpoint_dir):
raise FileNotFoundError("Checkpoint files not found under the path")
os.system("aws s3 cp --recursive {} {}".format(checkpoint_dir, checkpoint_path))
print("S3 checkpoint file path: {}".format(checkpoint_path))


### Run the evaluation step

Use the checkpointed model to run the evaluation step.

[ ]:

%%time

estimator_eval = RLEstimator(
role=role,
source_dir="src/",
dependencies=["common/sagemaker_rl"],
toolkit=RLToolkit.COACH,
toolkit_version="0.11.0",
framework=RLFramework.TENSORFLOW,
entry_point="evaluate-coach.py",
train_instance_count=1,
train_instance_type=instance_type,
hyperparameters={"evaluate_steps": 10000},
)
estimator_eval.fit({"checkpoint": checkpoint_path})


## Risk Disclaimer (for live-trading)

This notebook is for educational purposes only. Past trading performance does not guarantee future performance. The loss in trading can be substantial, and therefore investors should use all trading strategies at their own risk.

## Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.