Machine Learning on a TabText Dataframe

This notebook’s CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

An Example Based on the Paycheck Protection Program

The Paycheck Protection Program (PPP) was created by the U.S. government to enable employers struggling with COVID-related business adversities to make payments to their employees. For more information, see the Paycheck Protection Program. In this example notebook, you’ll learn how to run a machine learning model on a sample of companies in the program over the first two quarters of 2020.

In this notebook, we take U.S. Securities and Exchange Commission (SEC) filing data from some of the companies that partook of the loans under this program. We demonstrate how to merge the SEC filing data (text data) with stock price data (tabular data) using the SageMaker JumpStart Industry Python SDK. The build_tabText class of the library helps merge text dataframes with numeric dataframes to create a multimodal dataframe for machine learning.

A subset of the list of tickers of firms that took PPP loans obtained from authors of the following paper:

- Balyuk, T., Prabhala, N. and Puri, M. (November 2020, revised June 2021), Indirect Costs of Government Aid and Intermediary Supply Effects: Lessons from the Paycheck Protection Program. NBER Working Paper No. w28114, Available at SSRN: https://ssrn.com/abstract=3735682.

Companies took funds under the PPP because they were financially constrained and needed to make payroll. However, some companies availed of the loans even though they were not financially constrained. The market penalized the borrowers when they revealed that they took the PPP money, as they read the information as bad news, even for the companies that had no credit quality issues. Subsequently, the better companies returned the money they had borrowed because they wanted to recoup their share value.

We are interested in seeing if an ML model is able to detect whether the text of filings of companies that returned PPP money is different from that of companies that retained PPP money.

Legal Disclaimer: This example notebook is for demonstrative purposes only. It is not financial advice and should not be relied on as financial or investment advice.

General Steps

This notebook takes the following steps:

Read in over 400 tickers of companies that took the PPP loans.
Read in the 10-K, 10-Q, and 8K filings for all paycheck protection tickers during Q1 and Q2 of 2020. Texts in the SEC filings are loaded using the smjsindustry.DataLoader class.
Load a synthetic time series data of daily stock prices for the given tickers during Q1 and Q2 of 2020. Convert prices to returns. The simulated data is generated to be correlated with appropriate labels so that it can be meaningful. An analogous exercise with true data yields similar results.
Merge text and tabular datasets using the smjsindustry.build_tabText API.
Conduct machine learning analysis to obtain a baseline accuracy.
Build an AutoGluon model to analyze how stock prices and texts in the SEC filings are related to each company’s decision to accept or return the money. This notebook shows how to flag all filings of companies that return the money with a 1 and the filings of companies that do not return the money with a 0. A good fit to the data implies the model can distinguish companies into two categories: the ones that return PPP funding versus those that do not based on the text.
Deploy the endpoint for model inference.
Test the endpoint.

Objective

The goal in this notebook is to investigate the companies that partook of funding from the Paycheck Protection Program (PPP) and to study how stock prices and returns and text from the SEC forms are related to their decisions to return the money.

The PPP program is reported in each company’s 8-K, an SEC filing which is required when a public company experiences a material change in business conditions. In addition to a 8-K filing, 10-K and 10-Q filings, which present a comprehensive summary of a company’s financial performance, are also used as source of inputs for this study. The stock data is synthetically generated to be correlated with the labels. You can repeat this exercise with actual data as needed.

Note: You can also access this notebook through SageMaker JumpStart that is executable on SageMaker Studio. For more information, see Amazon SageMaker JumpStart Industry in the Amazon SageMaker Developer Guide.

Kernel and SageMaker Setup

Recommended kernel is conda_python3.

Ensure AutoGluon images information is available in SageMaker Python SDK.

[ ]:

!pip install -q -U "sagemaker>=2.66"

[ ]:

import sagemaker

session = sagemaker.Session()
region = session._region_name
bucket = session.default_bucket()
role = sagemaker.get_execution_role()
ppp_folder = "jumpstart_industry_ppp"
train_instance_type = "ml.c5.2xlarge"
inference_instance_type = "ml.m5.xlarge"

Load Data, SDK, and Dependencies

The following code cells download the smjsindustry SDK, dependencies, and dataset from an Amazon S3 bucket prepared by SageMaker JumpStart Industry. You will learn how to use the smjsindustry SDK which contains various APIs to curate SEC datasets. The dataset in this example was synthetically generated using the smjsindustry package’s SEC Forms Retrieval tool.

[ ]:

notebook_artifact_bucket = f"jumpstart-cache-prod-{region}"
notebook_data_prefix = "smfinance-notebook-data/ppp"
notebook_sdk_prefix = "smfinance-notebook-dependency/smjsindustry"

[ ]:

# Download dataset
data_bucket = f"s3://{notebook_artifact_bucket}/{notebook_data_prefix}"
!aws s3 sync $data_bucket ./ --exclude "*" --include "*.csv" --include "*.xlsx"

Install smjsindustry package from whl artifact running the following code block. Alternatively, we can also use pip install smjsindustry==1.0.0.

[ ]:

# Install smjsindustry SDK
sdk_bucket = f"s3://{notebook_artifact_bucket}/{notebook_sdk_prefix}"
!aws s3 sync $sdk_bucket ./

!pip install --no-index smjsindustry-1.0.0-py3-none-any.whl

Import Packages

[ ]:

import os
import boto3
import json
import pandas as pd
import tarfile
import sagemaker
from sklearn.model_selection import train_test_split
from sagemaker.estimator import Estimator
from smjsindustry import build_tabText
from sklearn.metrics import classification_report, confusion_matrix

from ag_model import (
    AutoGluonTraining,
    AutoGluonInferenceModel,
    AutoGluonTabularPredictor,
)

Step 1: Read in the Tickers

Over 400 tickers are used for this study.

[ ]:

%pylab inline

ppp_tickers = pd.read_excel("ppp_tickers.xlsx", index_col=None, sheet_name=0, engine="openpyxl")
print("Number of PPP tickers =", ppp_tickers.shape[0])
ticker_list = list(set(ppp_tickers.ticker))
ppp_tickers.head()

Step 2: Read in the SEC Forms Filed by These Companies

This notebook retrieves all 10-K/Q, 8-K forms from the SEC servers for Q1 and Q2 of 2020. This was done using the SageMaker JumpStart Industry Python SDK’s DataLoader class. For reference, the time taken by the data loader process was around 30 minutes for curating a dataframe of over 4000 filings.
There is one 10K/Q form per quarter. These are quarterly reports.
There can be multiple 8K forms per quarter, because these are filed for material changes in business conditions. Depending on how many such events there are, several 8Ks might need to be filed. As you will see, this notebook retrieves more than one form per quarter.
The dataset was stored in a CSV file named ppp_10kq_8k_data.csv (351 MB).

Legal Disclaimer: This example notebook uses data obtained from the SEC EDGAR database. You are responsible for complying with EDGAR’s access terms and conditions located in the Accessing EDGAR Data page.

[ ]:

%%time
# Text data
df_sec = pd.read_csv("ppp_10kq_8k_data.csv")

[ ]:

print("Number of SEC filings: ", df_sec.shape[0])

Step 3: Collect Stock Prices and Convert to Returns

Given the list of tickers, we synthetically generated stock prices using simulation of geometric Brownian motion. The stock prices are generated to be consistent with the real market data. You can buy data for commercial use if needed.
Convert the stock prices to returns.

Some tickers might have been delisted since the time of the PPP program.

[ ]:

df_prices = pd.read_csv("ppp_stock_prices_synthetic.csv")
print("Total number of days for the stock time series: ", df_prices.shape[0])
print("Total number of stocks: ", df_prices.shape[1])
df_prices.head()

The following code cell converts the prices into percentage returns.

It converts prices into returns.
It calls helper function to convert prices to returns.
It removes the stock that only has NaN values, if any.
It converts prices to returns using the pct_change function.

[ ]:

def convert_price_to_return(df_prices):
    ticker_list = list(df_prices.columns[1:])
    df_returns = df_prices[ticker_list].pct_change()  # not using fill_method='ffill'
    df_returns = pd.concat([df_prices.Date, df_returns], axis=1)[1:]  # drop first row as it is NaN
    df_returns = df_returns.reset_index(drop=True)
    return df_returns


df_returns = convert_price_to_return(df_prices)
df_returns.dropna(axis=1, how="all", inplace=True)  # drop columns with partial data
df_returns.set_index("Date", inplace=True)
print("Total number of stocks: ", len(list(df_returns.columns[1:])))
df_returns.head()

[ ]:

df_returns.to_csv("ppp_returns.csv", index=True)

Step 4: Merge Text and Tabular Datasets

The stock returns and the SEC forms are saved in earlier code blocks into CSV files. In this step, you’ll learn how to read in the files and merge the text data with the tabular data.

Line up the returns from day -5 before the filing date to day +5 after the filing date. Including the return on the filing date itself, we get 11 days of returns around the filing date. Three types of returns are considered here: > Ret - stock return > MktRet - S&P 500 return > NetRet - difference between Ret and MktRet
Merge the SEC text data and the tabular data with the build_tabText API. We need to see how returns evolve around the filing date.

[ ]:

%%time

df_returns = pd.read_csv("ppp_returns.csv")  # Tabular/numeric data
df_sec = pd.read_csv("ppp_10kq_8k_data.csv")  # Text data

Define helper functions to create 3 types of returns for 5 days before and 5 days after the filing date. The functions fill in returns for the ticker and corresponding S&P return.

[ ]:

%%time


def fillReturn(df_returns, ticker, dt, displacement):
    if np.where(df_returns.columns == ticker)[0].size > 0:
        bwd = list(
            df_returns[ticker].loc[:dt][-(displacement + 1) :]
        )  # 5 days before filing plus filing date
        fwd = list(df_returns[ticker].loc[dt:][1 : (displacement + 1)])  # 5 days after filing
        if len(bwd) < displacement + 1:
            bwd = [np.nan] * (
                displacement + 1 - len(bwd)
            ) + bwd  # Add NaN at the beginning if less bwd
        if len(fwd) < displacement:
            fwd = fwd + [np.nan] * (displacement - len(fwd))  # Append NaN in the end if less fwd
        return bwd + fwd
    else:
        return [np.nan for idx in range(2 * displacement + 1)]


def create_df_5_days_return(df_returns):
    displace = 5
    cols = [
        "Date",
        "ticker",
        "Ret-5",
        "Ret-4",
        "Ret-3",
        "Ret-2",
        "Ret-1",
        "Ret0",
        "Ret1",
        "Ret2",
        "Ret3",
        "Ret4",
        "Ret5",
        "MktRet-5",
        "MktRet-4",
        "MktRet-3",
        "MktRet-2",
        "MktRet-1",
        "MktRet0",
        "MktRet1",
        "MktRet2",
        "MktRet3",
        "MktRet4",
        "MktRet5",
        "NetRet-5",
        "NetRet-4",
        "NetRet-3",
        "NetRet-2",
        "NetRet-1",
        "NetRet0",
        "NetRet1",
        "NetRet2",
        "NetRet3",
        "NetRet4",
        "NetRet5",
    ]
    df_trans_dict = {}
    idx = 0
    for ticker in df_returns.columns[1:]:
        for row in range(len(df_returns)):
            dt = df_returns.Date[row]
            rets = fillReturn(df_returns, ticker, dt, displace)
            mkt_rets = fillReturn(df_returns, "^GSPC", dt, displace)
            net_rets = [a - b for a, b in zip(rets, mkt_rets)]
            row_data = [dt, ticker] + rets + mkt_rets + net_rets
            df_trans_dict[idx] = row_data
            idx += 1
    df_returns_trans = pd.DataFrame.from_dict(df_trans_dict, orient="index", columns=cols)
    return df_returns_trans


df_returns_trans = create_df_5_days_return(df_returns)
pd.set_option("display.max_columns", 50)
df_returns_trans.head(5)

Create a TabText dataframe

The following code cell calls the smjsindustry.build_tabText class to create a multimodal TabText dataframe, merging the tabular data and the text data together; the dataframe should have the Date column and a common column (‘ticker’ in this case) to generate a time series TabText dataset.

[ ]:

%%time
# Use build_tabText API to merge text and tabular datasets
tab_text = build_tabText(
    df_sec, "ticker", "filing_date", df_returns_trans, "ticker", "Date", freq="D"
)

tab_text.head()

Write the merged dataframe into a CSV file

[ ]:

tab_text.to_csv("ppp_10kq_8k_stock_data.csv", index=False)

Step 5: Machine Learning Analysis

Some of these companies subsequently returned the money, for various reasons, ranging from:
* They did not need the money,
* Their stock prices were penalized for taking the loans. When they returned the money, the market rewarded them, * They wished to avoid additional regulatory oversight.

Returning the money results in signaling an improvement in business conditions with a subsequent uptick in stock prices. Thus, an exercise to predict which firms would return the money based on their SEC filings might be of interest.

The following code cells prepare the dataset for ML studies with the following steps:

* It flags all filings of the companies that returned the PPP money with a 1 and the others with a 0. Therefore, an ML model fit to these labels teases out whether the text for companies that retain PPP money is distinguishable from text of companies that return PPP money.

The resultant dataframe from the previous steps is stored as a CSV file titled ppp_model_TabText.csv (354 MB). This file contains both text and numerical columns of data.

[ ]:

tab_text = pd.read_csv("ppp_10kq_8k_stock_data.csv")

ppp_tickers_returned = pd.read_excel(
    "ppp_tickers_returned.xlsx", index_col=None, sheet_name=0, engine="openpyxl"
)
print("Number of PPP Returned tickers =", ppp_tickers_returned.shape[0])
ticker_list_returned = list(set(ppp_tickers_returned.ticker))

[ ]:

tab_text["returned"] = [1 if j in ticker_list_returned else 0 for j in tab_text["ticker"]]
tab_text

[ ]:

tab_text["returned"] = [1 if j in ticker_list_returned else 0 for j in tab_text["ticker"]]
tab_text = tab_text.drop(["Date"], axis=1)
tab_text.to_csv("ppp_model_TabText.csv", index=False)

You can start examining the mean return in the 5 days before the filing (-5,0) and 5 days after the filing (0,+5) to see how the firms that returned the money fared, compared to those that did not return the money. You’ll learn how the mean excess return (over the S&P return) between the two groups are calculated.

[ ]:

df = pd.read_csv("ppp_model_TabText.csv")
print(df.shape)
print(df.columns)

Next, the following cell curates the TabText dataframe by creating a cumulative (net of market) return for the 5 days before the filing (df["First5"]) and the 5 days after the filing (df["Second5"]). You can also see the various feature columns shown in the dataframe as shown in the following cell.

[ ]:

# Add up the returns for days (-5,0) denoted "First5" and days (0,5) denoted second 5
# Note that it is actually 6 days of returns.
df["First5"] = (
    df["NetRet-5"]
    + df["NetRet-4"]
    + df["NetRet-3"]
    + df["NetRet-2"]
    + df["NetRet-1"]
    + df["NetRet0"]
)
df["Second5"] = (
    df["NetRet5"] + df["NetRet4"] + df["NetRet3"] + df["NetRet2"] + df["NetRet1"] + df["NetRet0"]
)
df.head()

[ ]:

res = df.groupby(["returned"]).count()["ticker"]
print(res)
print("Baseline accuracy =", res[0] / sum(res))

[ ]:

df.groupby(["returned"]).mean()[["First5", "Second5"]]

From the output of the preceding cell, the mean return for the "First5" set is slightly worse for the "returned=0" case and the mean return for the "Second5" set is higher for the "returned=1" case. Maybe firms that returned the money were signalling to the market that they were in good shape and the market rewarded them with a stock price bounce.

Step 6: Machine Learning on the TabText Dataframe

In this notebook, an AutoGluon model is used to analyze how leading stock returns for 5 days (numerical data) and 10-K/Q, 8-K filings (text) are related to each company’s decision to accept or return the money.

Train an AutoGluon Model for Classification

Here, you’ll see how easy it is to undertake a seamless ML on multimodal data (TabText). In this section, you’ll learn how to use one of the open source AWS libraries known as AutoGluon, which is a part of the Gluon NLP family of tools. To learn more, see GluonNLP: NLP made easy.

In particular, we use the AutoGluon-Tabular model, which is designed for TabText and has superior performance. For more information about the model, see AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data.

For a quick start, see Predicting Columns in a Table - Quick Start. To find the AutoGluon-Tabular model in AWS Marketplace, see AutoGluon-Tabular.

The AutoGluon-Tabular model processes the data and trains a diverse ensemble of ML models to create a “predictor” which is able to predict the "returned" label in this data. This example uses both return and text data to build a model.

Create a sample dataset

For demonstration purposes, take a sample from the original dataset to reduce the time for training.

[ ]:

sample_df = pd.concat(
    [df[df["returned"] == 1].sample(n=500), df[df["returned"] == 0].sample(n=500)]
).sample(frac=1)

Save the dataframe into a CSV file.

[ ]:

sample_df.to_csv("ppp_model_sample_input.csv", index=False)

Split the sample dataset into a training dataset and a test dataset

[ ]:

from sklearn.model_selection import train_test_split

sample_df_ag = sample_df[["First5", "text", "returned"]]
train_data, test_data = train_test_split(sample_df_ag, test_size=0.2, random_state=123)

[ ]:

train_data.to_csv("train_data.csv", index=False)
test_data.to_csv("test_data.csv", index=False)

train_s3_path = session.upload_data(
    "train_data.csv", bucket=bucket, key_prefix=ppp_folder + "/" + "data"
)
test_s3_path = session.upload_data(
    "test_data.csv", bucket=bucket, key_prefix=ppp_folder + "/" + "data"
)
config_s3_path = session.upload_data(
    os.path.join("code", "config.yaml"), bucket=bucket, key_prefix=ppp_folder + "/" + "config"
)

Run a SageMaker training job

The training job takes around 10 minutes with the sample dataset. If you want to train a model with your own data, you might need to update the training script train.py or configuration file config.yaml in thecode folder. If you want to use GPU instance to achieve a better accuracy, replace train_instance_type with the desired GPU instance.

[ ]:

ag = AutoGluonTraining(
    role=role,
    entry_point="code/train.py",
    region=region,
    instance_count=1,
    instance_type=train_instance_type,
    framework_version="0.3.1",
    py_version="py37",
    base_job_name="jumpstart-example-classic-gecko-ppp",
    enable_network_isolation=True,  # Set enable_network_isolation=True to ensure a security running environment
)

ag.fit(
    {"config": config_s3_path, "train": train_s3_path, "test": test_s3_path},
)

Download Model Outputs

Download the following files (training job artifacts) from the SageMaker session’s default S3 bucket: * leaderboard.csv * predictions.csv * feature_importance.csv * evaluation.json

[ ]:

s3_client = boto3.client("s3")
job_name = ag._current_job_name
s3_client.download_file(bucket, f"{job_name}/output/output.tar.gz", "output.tar.gz")

with tarfile.open("output.tar.gz", "r:gz") as so:
    so.extractall()

Score details of each model

[ ]:

leaderboard = pd.read_csv("leaderboard.csv")
leaderboard

The result of the training evaluation

[ ]:

with open("evaluation.json") as f:
    data = json.load(f)
print(data)

The evaluation.json file reports all the usual metrics as well as the Matthews correlation coefficient (MCC). This is a more comprehensive metric for an unbalanced dataset. It ranges from \(-1\) to \(+1\), where \(-1\) implies perfect misclassification and \(+1\) is perfect classification.

Reference: Davide Chicco & Giuseppe Jurman (2020), The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation, BMC Genomics volume 21, Article number: 6

Note: Various metrics are discussed in Receiver operating characteristic in Wikipedia, the free encyclopedia.

Step 7: Deploy the endpoint

In this step, we deploy the model artifact from Step 6 and use for inference. We use AutoGluonInferenceModel defined in ag_model.py to create an AutoGluon model and SageMaker model deployment APIs to deploy an endpoint. If you bring your own data for inference, you may also need to update the inference script inference.py in the code folder.

[ ]:

training_job_name = ag.latest_training_job.name
print("Training job name: ", training_job_name)

[ ]:

ag_estimator = Estimator.attach(training_job_name)
ag_estimator.model_data

[ ]:

endpoint_name = "jumpstart-example-classic-gecko-ppp-endpoint"

ag_model = AutoGluonInferenceModel(
    model_data=ag.model_data,
    role=role,
    region=region,
    framework_version="0.3.1",
    instance_type=inference_instance_type,
    entry_point="code/inference.py",
    predictor_cls=AutoGluonTabularPredictor,
    name="jumpstart-example-classic-gecko-ppp-model",
)

ppp_predictor = ag_model.deploy(
    initial_instance_count=1,
    instance_type=inference_instance_type,
    endpoint_name=endpoint_name,
)

Step 8: Test the endpoint

We randomly select some data from the test dataset and test the endpoint.

[ ]:

test_endpoint_data = test_data.sample(n=5).drop(["returned"], axis=1)

[ ]:

test_endpoint_data

[ ]:

ppp_predictor.predict(test_endpoint_data.values)

Clean Up

After you are done using this notebook, delete the model artifacts and other resources to avoid any incurring charges.

Caution: You need to manually delete resources that you may have created while running the notebook, such as Amazon S3 buckets for model artifacts, training datasets, processing artifacts, and Amazon CloudWatch log groups.

For more information about cleaning up resources, see Clean Up in the Amazon SageMaker Developer Guide.

[ ]:

ppp_predictor.delete_model()
ppp_predictor.delete_endpoint()

License

The SageMaker JumpStart Industry product and its related materials are under the Legal License Terms.

Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.