Text Explainability for SageMaker BlazingText

This notebook will not work on SageMaker Studio notebook. Please run this notebook on SageMaker Notebook instances and set the kernel to conda_python3.

Amazon SageMaker Clarify helps improve your machine learning models by detecting potential bias and helping explain how these models make predictions. The fairness and explainability functionality provided by SageMaker Clarify takes a step towards enabling AWS customers to build trustworthy and understandable machine learning models. The product comes with the tools to help you with the following tasks.

Measure biases that can occur during each stage of the ML lifecycle (data collection, model training and tuning, and monitoring of ML models deployed for inference). Generate model governance reports targeting risk and compliance teams and external regulators. Provide explanations of the data, models, and monitoring used to assess predictions for input containing data of various modalities like numerical data, categorical data, text, and images.

Learn more about SageMaker Clarify here. This sample notebook walks you through:

  1. Key terms and concepts needed to understand SageMaker Clarify

  2. The incremental updates required to explain text features, along with other tabular features.

  3. Explaining the importance of the various new input features on the model’s decision

SageMaker Clarify currently accepts only CSV, and JSON Lines formats as input and the SageMaker BlazingText Algorithm container only accepts JSON as the inference input format and thus can’t be directly used. In order to use Clarify, we have 2 options as below -

Option A

Architecture

The notebook follows this approach which will first train a Text Classification Model using SageMaker BlazingText using the SageMaker Estimator in the SageMaker Python SDK. We will take this Model and host the model using fasttext BYOC SageMaker container for Clarify to use and run Explainability job. The BYOC container will accept the inference request in CSV format allowing SageMaker Clarify to explain the inference results of SageMaker BlazingText Model.

Option B

Architecture

Alternate approach which will scale to other model types is to use SageMaker Inference Pipeline and have a pre-processing container to convert from SageMaker Clarify acceptable format (i.e. JSON Lines or CSV) to the format acceptable by the underlying Model (i.e. JSON for SageMaker BlazingText algorithm). Although this approach is not demoed in this notebook, this approach can be used for other model types.

Training Text Classification Model using SageMaker BlazingText Algorithm

Text Classification can be used to solve various use-cases like sentiment analysis, spam detection, hashtag prediction etc. This notebook demonstrates the use of SageMaker BlazingText to perform supervised binary/multi class with single or multi label text classification. BlazingText can train the model on more than a billion words in a couple of minutes using a multi-core CPU or a GPU, while achieving performance on par with the state-of-the-art deep learning text classification algorithms. BlazingText extends the fastText text classifier to leverage GPU acceleration using custom CUDA kernels.

Install required libraries

[ ]:
! pip install sagemaker botocore boto3 awscli --upgrade

Setup

Let’s start by specifying:

  • The S3 bucket and prefix that you want to use for training and model data. This should be within the same region as the Notebook Instance, training, and hosting. If you don’t specify a bucket, SageMaker SDK will create a default bucket following a pre-defined naming convention in the same region.

  • The IAM role ARN used to give SageMaker access to your data. It can be fetched using the get_execution_role method from SageMaker python SDK.

[ ]:
import sagemaker
from sagemaker import get_execution_role
import json
import boto3

sess = sagemaker.Session()

role = get_execution_role()
print(
    role
)  # This is the role that SageMaker would use to leverage AWS resources (S3, CloudWatch) on your behalf

bucket = sess.default_bucket()  # Replace with your own bucket name if needed
print(bucket)
prefix = "blazingtext/supervised"  # Replace with the prefix under which you want to store the data if needed

Data Preparation

Now we’ll download a dataset from the web on which we want to train the text classification model. BlazingText expects a single preprocessed text file with space separated tokens and each line of the file should contain a single sentence and the corresponding label(s) prefixed by "\__label\__".

In this example, let us train the text classification model on the DBPedia Ontology Dataset as done by Zhang et al. The DBpedia ontology dataset is constructed by picking 14 nonoverlapping classes from DBpedia 2014. It has 560,000 training samples and 70,000 testing samples. The fields we used for this dataset contain title and abstract of each Wikipedia article.

The below dataset is attributed to this original source and is shared under this license.

[ ]:
!wget https://sagemaker-sample-files.s3.us-east-1.amazonaws.com/datasets/text/dbpedia/dbpedia.tar.gz
[ ]:
from sagemaker.s3 import S3Downloader as downloader

downloader.download(
    s3_uri="s3://sagemaker-sample-files/datasets/text/dbpedia/dbpedia.tar.gz",
    local_path=".",
    sagemaker_session=sess,
)
[ ]:
!tar -xzvf dbpedia.tar.gz
[ ]:
!head dbpedia_csv/train.csv -n 3
[ ]:
!cat dbpedia_csv/classes.txt
[ ]:
index_to_label = {}
with open("dbpedia_csv/classes.txt") as f:
    for i, label in enumerate(f.readlines()):
        index_to_label[str(i + 1)] = label.strip()
print(index_to_label)

Data Preprocessing

We need to preprocess the training data into space separated tokenized text format which can be consumed by BlazingText algorithm. Also, as mentioned previously, the class label(s) should be prefixed with __label__ and it should be present in the same line along with the original sentence. We’ll use nltk library to tokenize the input sentences from DBPedia dataset.

[ ]:
from random import shuffle
import multiprocessing
from multiprocessing import Pool
import csv
import nltk

nltk.download("punkt")
[ ]:
def transform_instance(row):
    cur_row = []
    label = "__label__" + index_to_label[row[0]]  # Prefix the index-ed label with __label__
    cur_row.append(label)
    cur_row.extend(nltk.word_tokenize(row[1].lower()))
    cur_row.extend(nltk.word_tokenize(row[2].lower()))
    return cur_row
[ ]:
def preprocess(input_file, output_file, keep=1):
    all_rows = []
    with open(input_file, "r") as csvinfile:
        csv_reader = csv.reader(csvinfile, delimiter=",")
        for row in csv_reader:
            all_rows.append(row)
    shuffle(all_rows)
    all_rows = all_rows[: int(keep * len(all_rows))]
    pool = Pool(processes=multiprocessing.cpu_count())
    transformed_rows = pool.map(transform_instance, all_rows)
    pool.close()
    pool.join()

    with open(output_file, "w") as csvoutfile:
        csv_writer = csv.writer(csvoutfile, delimiter=" ", lineterminator="\n")
        csv_writer.writerows(transformed_rows)
[ ]:
%%time

# Preparing the training dataset

# Since preprocessing the whole dataset might take a couple of minutes,
# we keep 20% of the training dataset for this demo.
# Set keep to 1 if you want to use the complete dataset
preprocess("dbpedia_csv/train.csv", "dbpedia.train", keep=0.2)

# Preparing the validation dataset
preprocess("dbpedia_csv/test.csv", "dbpedia.validation")
[ ]:
%%time

train_channel = prefix + "/train"
validation_channel = prefix + "/validation"

sess.upload_data(path="dbpedia.train", bucket=bucket, key_prefix=train_channel)
sess.upload_data(path="dbpedia.validation", bucket=bucket, key_prefix=validation_channel)

s3_train_data = "s3://{}/{}".format(bucket, train_channel)
s3_validation_data = "s3://{}/{}".format(bucket, validation_channel)
[ ]:
s3_output_location = "s3://{}/{}/output".format(bucket, prefix)
print(s3_output_location)

Training

Now that we are done with all the setup that is needed, we are ready to train our object detector. To begin, let us create a sageMaker.estimator.Estimator object. This estimator will launch the training job.

[ ]:
region_name = boto3.Session().region_name
[ ]:
container = sagemaker.amazon.amazon_estimator.get_image_uri(region_name, "blazingtext", "latest")
print("Using SageMaker BlazingText container: {} ({})".format(container, region_name))

Training the BlazingText model for supervised text classification

Similar to the original implementation of Word2Vec, SageMaker BlazingText provides an efficient implementation of the continuous bag-of-words (CBOW) and skip-gram architectures using Negative Sampling, on CPUs and additionally on GPU[s]. The GPU implementation uses highly optimized CUDA kernels. To learn more, please refer to *BlazingText: Scaling and Accelerating Word2Vec using Multiple GPUs*.

BlazingText also supports a supervised mode for text classification. It extends the FastText text classifier to leverage GPU acceleration using custom CUDA kernels. The model can be trained on more than a billion words in a couple of minutes using a multi-core CPU or a GPU, while achieving performance on par with the state-of-the-art deep learning text classification algorithms. For more information, please refer to the algorithm documentation.

To summarize, the following modes are supported by BlazingText on different types instances:

Modes

cbow (supports subwords training)

skipgram (supports subwords training)

batch_skipgram

supervised

Single CPU instance

Single GPU instance

✔ (Instance with 1 GPU only)

Multiple CPU instances

Now, let’s define the SageMaker Estimator with resource configurations and hyperparameters to train Text Classification on DBPedia dataset, using “supervised” mode on a c4.4xlarge instance.

Refer to BlazingText Hyperparameters in the Amazon SageMaker documentation for the complete list of hyperparameters.

[ ]:
bt_model = sagemaker.estimator.Estimator(
    container,
    role,
    instance_count=1,
    instance_type="ml.c4.4xlarge",
    volume_size=30,
    max_run=360000,
    input_mode="File",
    output_path=s3_output_location,
    hyperparameters={
        "mode": "supervised",
        "epochs": 1,
        "min_count": 2,
        "learning_rate": 0.05,
        "vector_dim": 10,
        "early_stopping": True,
        "patience": 4,
        "min_epochs": 5,
        "word_ngrams": 2,
    },
)
[ ]:
train_data = sagemaker.inputs.TrainingInput(
    s3_train_data,
    distribution="FullyReplicated",
    content_type="text/plain",
    s3_data_type="S3Prefix",
)
validation_data = sagemaker.inputs.TrainingInput(
    s3_validation_data,
    distribution="FullyReplicated",
    content_type="text/plain",
    s3_data_type="S3Prefix",
)
data_channels = {"train": train_data, "validation": validation_data}

We have our Estimator object, we have set the hyper-parameters for this object and we have our data channels linked with the algorithm. The only remaining thing to do is to train the algorithm. The following command will train the algorithm. The estimated total training time based on the training job configuration is approximately 10 mins. Training the algorithm involves a few steps. Firstly, the instance that we requested while creating the Estimator classes is provisioned and is setup with the appropriate libraries. Then, the data from our channels are downloaded into the instance. Once this is done, the training job begins. The provisioning and data downloading will take some time, depending on the size of the data. Therefore it might be a few minutes before we start getting training logs for our training jobs. The data logs will also print out Accuracy on the validation data for every epoch after training job has executed min_epochs. This metric is a proxy for the quality of the algorithm.

Once the job has finished a “Job complete” message will be printed. The trained model can be found in the S3 bucket that was setup as output_path in the estimator.

[ ]:
bt_model.fit(inputs=data_channels, logs=True)

Packaging and Uploading your own container for Inference with Amazon SageMaker

An overview of Docker

If you’re familiar with Docker already, you can skip ahead to the next section.

For many data scientists, Docker containers are a new concept, but they are not difficult, as you’ll see here.

Docker provides a simple way to package arbitrary code into an image that is totally self-contained. Once you have an image, you can use Docker to run a container based on that image. Running a container is just like running a program on the machine except that the container creates a fully self-contained environment for the program to run. Containers are isolated from each other and from the host environment, so the way you set up your program is the way it runs, no matter where you run it.

Docker is more powerful than environment managers like conda or virtualenv because (a) it is completely language independent and (b) it comprises your whole operating environment, including startup commands, environment variable, etc.

In some ways, a Docker container is like a virtual machine, but it is much lighter weight. For example, a program running in a container can start in less than a second and many containers can run on the same physical machine or virtual machine instance.

Docker uses a simple file called a Dockerfile to specify how the image is assembled. We’ll see an example of that below. You can build your Docker images based on Docker images built by yourself or others, which can simplify things quite a bit.

Docker has become very popular in the programming and devops communities for its flexibility and well-defined specification of the code to be run.

Amazon SageMaker uses Docker to allow users to train and deploy arbitrary algorithms.

In Amazon SageMaker, Docker containers are invoked in a certain way for training and a slightly different way for hosting. The following sections outline how to build containers for the SageMaker environment.

Some helpful links:

How Amazon SageMaker runs your Docker container

Because you can run the same image in training or hosting, Amazon SageMaker runs your container with the argument train or serve. How your container processes this argument depends on the container:

  • If you specify a program as an ENTRYPOINT in the Dockerfile, that program will be run at startup and its first argument will be train or serve. The program can then look at that argument and decide what to do.

  • If you are building separate containers for training and hosting (or building only for one or the other), you can define a program as an ENTRYPOINT in the Dockerfile and ignore (or verify) the first argument passed in.

Hosting has a very different model than training because hosting is responding to inference requests that come in via HTTP. In this example, we use our recommended Python serving stack to provide robust and scalable serving of inference requests:

Request serving stack

This stack is implemented in the sample code here and you can mostly just leave it alone.

Amazon SageMaker uses two URLs in the container:

  • /ping will receive GET requests from the infrastructure. Your program returns 200 if the container is up and accepting requests.

  • /invocations is the endpoint that receives client inference POST requests. The format of the request and the response is up to the algorithm. If the client supplied ContentType and Accept headers, these will be passed in as well.

The container will have the model files in the same place they were written during training:

/opt/ml
`-- model
    `-- <model files>

The parts of the sample container

In the container directory are all the components you need to package the sample algorithm for Amazon SageMager:

.
|-- Dockerfile
|-- build_and_push.sh
`-- blazing_text
    |-- nginx.conf
    |-- predictor.py
    |-- serve
    `-- wsgi.py

Let’s discuss each of these in turn:

  • ``Dockerfile`` describes how to build your Docker container image. More details below.

  • ``build_and_push.sh`` is a script that uses the Dockerfile to build your container images and then pushes it to ECR. We’ll invoke the commands directly later in this notebook, but you can just copy and run the script for your own algorithms.

  • ``blazing_text`` is the directory which contains the files that will be installed in the container.

  • ``local_test`` is a directory that shows how to test your new container on any computer that can run Docker, including an Amazon SageMaker notebook instance. Using this method, you can quickly iterate using small datasets to eliminate any structural bugs before you use the container with Amazon SageMaker. We’ll walk through local testing later in this notebook. Please copy the model.bin file (output of the training job) to the container/local_test/test_dir/model directory.

In this simple application, we only install five files in the container. You may only need that many or, if you have many supporting routines, you may wish to install more. These five show the standard structure of our Python containers, although you are free to choose a different toolset and therefore could have a different layout. If you’re writing in a different programming language, you’ll certainly have a different layout depending on the frameworks and tools you choose.

The files that we’ll put in the container are:

  • ``nginx.conf`` is the configuration file for the nginx front-end. Generally, you should be able to take this file as-is.

  • ``predictor.py`` is the program that actually implements the Flask web server and the decision tree predictions for this app. You’ll want to customize the actual prediction parts to your application. Since this algorithm is simple, we do all the processing here in this file, but you may choose to have separate files for implementing your custom logic.

  • ``serve`` is the program started when the container is started for hosting. It simply launches the gunicorn server which runs multiple instances of the Flask app defined in predictor.py. You should be able to take this file as-is.

  • ``wsgi.py`` is a small wrapper used to invoke the Flask app. You should be able to take this file as-is.

In summary, there is one file you will probably want to change for your application - predictor.py.

The Dockerfile

The Dockerfile describes the image that we want to build. You can think of it as describing the complete operating system installation of the system that you want to run. A Docker container running is quite a bit lighter than a full operating system, however, because it takes advantage of Linux on the host machine for the basic operations.

For the Python science stack, we will start from a standard Ubuntu installation and run the normal tools to install the things needed by scikit-learn. Finally, we add the code that implements our specific algorithm to the container and set up the right environment to run under.

Along the way, we clean up extra space. This makes the container smaller and faster to start.

Let’s look at the Dockerfile for the example:

[ ]:
!cat container/Dockerfile

Building and registering the container

The following shell code shows how to build the container image using docker build and push the container image to ECR using docker push. This code is also available as the shell script container/build-and-push.sh, which you can run as build-and-push.sh sagemaker-blazing-text to build the image sagemaker-blazing-text.

This code looks for an ECR repository in the account you’re using and the current default region (if you’re using a SageMaker notebook instance, this will be the region where the notebook instance was created). If the repository doesn’t exist, the script will create it.

[ ]:
algorithm_name = "sagemaker-blazing-text"
[ ]:
byoc_image_uri = "{}.dkr.ecr.{}.amazonaws.com/{}:latest".format(
    sess.account_id(), region_name, algorithm_name
)
print(byoc_image_uri)
[ ]:
%%sh

# The name of our algorithm
algorithm_name=sagemaker-blazing-text

cd container

chmod +x blazing_text/serve

account=$(aws sts get-caller-identity --query Account --output text)

# Get the region defined in the current configuration (default to us-west-2 if none defined)
region=$(aws configure get region)
region=${region:-us-west-2}

fullname="${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest"

# If the repository doesn't exist in ECR, create it.
aws ecr describe-repositories --repository-names "${algorithm_name}" > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "${algorithm_name}" > /dev/null
fi

# Get the login command from ECR and execute it directly
aws ecr get-login-password --region ${region}|docker login --username AWS --password-stdin ${fullname}

# Build the docker image locally with the image name and then push it to ECR
# with the full name.

docker build  -t ${algorithm_name} .
docker tag ${algorithm_name} ${fullname}

docker push ${fullname}

Testing your algorithm on your local machine or on an Amazon SageMaker notebook instance

While you’re first packaging an algorithm use with Amazon SageMaker, you probably want to test it yourself to make sure it’s working right. In the directory container/local_test, there is a framework for doing this. It includes three shell scripts for running and using the container and a directory structure that mimics the one outlined above.

The scripts are:

  • serve_local.sh: Run this with the name of the image once you’ve trained the model and it should serve the model. For example, you can run $ ./serve_local.sh sagemaker-blazing-text. It will run and wait for requests. Simply use the keyboard interrupt to stop it.

  • predict.sh: Run this with the name of a payload file and (optionally) the HTTP content type you want. The content type will default to text/csv. For example, you can run $ ./predict.sh payload.csv text/csv.

The directories as shipped are set up to test the blazingtext sample algorithm presented here.

Deploy your trained SageMaker BlazingText Model using your own container in Amazon SageMaker

Once you have your container packaged, you can use it for hosting the model for real-time inference. Let’s do that with the algorithm we made above.

Set up the environment

Here we specify a bucket to use and the role that will be used for working with SageMaker.

[ ]:
# S3 prefix
prefix = "byom-blazingtext"

# Define IAM role
import boto3
import re

import os
import numpy as np
import pandas as pd
from sagemaker import get_execution_role

role = get_execution_role()

Hosting your model

You can use a trained model to get real time predictions using HTTP endpoint. Follow these steps to walk you through the process.

Deploy the model

Deploying the model to SageMaker hosting just requires a deploy call on the fitted model. This call takes an instance count, instance type, and optionally serializer and deserializer functions. These are used when the resulting predictor is created on the endpoint.

[ ]:
import sagemaker
from datetime import datetime
from sagemaker.model import Model
from sagemaker.predictor import Predictor

model_name = "blazing-text-byo-model-clarify-demo-{}".format(
    datetime.now().strftime("%d-%m-%Y-%H-%M-%S")
)
print(model_name)
sagemaker_session = sagemaker.Session()
sagemaker_model = Model(
    model_data=bt_model.model_data,
    role=role,
    image_uri=byoc_image_uri,
    sagemaker_session=sagemaker_session,
    predictor_cls=Predictor,
    name=model_name,
)
[ ]:
from sagemaker.predictor import csv_serializer

predictor = sagemaker_model.deploy(1, "ml.m4.xlarge", serializer=csv_serializer)

Choose some data and use it for a prediction

In order to do some predictions, we’ll extract some of the data we used for training and do predictions against it. This is, of course, bad statistical practice, but a good way to see how the mechanism works.

[ ]:
shape = pd.read_csv("container/local_test/payload.csv", header=None)
shape.sample(2)

Prediction is as easy as calling predict with the predictor we got back from deploy and the data we want to do predictions with. The serializers take care of doing the data conversions for us.

[ ]:
import pandas as pd

file_path = "container/local_test/payload.csv"

df_test_clarify = pd.read_csv(file_path)
[ ]:
from sagemaker import clarify

clarify_processor = clarify.SageMakerClarifyProcessor(
    role=role,
    instance_count=1,
    instance_type="ml.m5.xlarge",
    sagemaker_session=sagemaker_session,
)

model_config = clarify.ModelConfig(
    model_name=model_name,
    instance_type="ml.m5.xlarge",
    instance_count=1,
    accept_type="application/jsonlines",
    content_type="text/csv",
    endpoint_name_prefix=None,
)

explainability_output_path = "s3://{}/{}/clarify-text-explainability".format(
    sagemaker_session.default_bucket(), "explainability"
)
explainability_data_config = clarify.DataConfig(
    s3_data_input_path=file_path,
    s3_output_path=explainability_output_path,
    headers=["Review Text"],
    dataset_type="text/csv",
)
[ ]:
shap_config = clarify.SHAPConfig(
    baseline=[["<UNK>"]],
    num_samples=1000,
    agg_method="mean_abs",
    save_local_shap_values=True,
    text_config=clarify.TextConfig(granularity="token", language="english"),
)
[ ]:
from sagemaker.clarify import ModelPredictedLabelConfig

modellabel_config = ModelPredictedLabelConfig(probability="prob", label="label")
[ ]:
# Running the Clarify explainability job involves spinning up a processing job and a model endpoint which may take a few minutes.
# After this you will see a progress bar for the SHAP computation.
# The size of the dataset (num_examples) and the num_samples for shap will effect the running time.
clarify_processor.run_explainability(
    data_config=explainability_data_config,
    model_config=model_config,
    explainability_config=shap_config,
    model_scores=modellabel_config,
)
[ ]:
import json

local_feature_attributions_file = "out.jsonl"
analysis_results = []
analysis_result = sagemaker.s3.S3Downloader.download(
    explainability_output_path + "/explanations_shap/" + local_feature_attributions_file,
    local_path="./",
)

shap_out = []
file = sagemaker.s3.S3Downloader.read_file(
    explainability_output_path + "/explanations_shap/" + local_feature_attributions_file
)
for line in file.split("\n"):
    if line:
        shap_out.append(json.loads(line))
[ ]:
print(json.dumps(shap_out[0], indent=2))

At the highest level of this JSON Line, there are two keys: explanations, join_source_value (Not present here as we have not included a joinsource column in the input dataset). Explanations contains a list of attributions for each feature in the dataset. In this case, we have a single element, because the input dataset also had a single feature. It also contains details like feature_name, data_type of the features (indicating whether Clarify inferred the column as numerical, categorical or text). Each token attribution also contains a description field that contains the token itself, and the starting index of the token in original input. This allows you to reconstruct the original sentence from the output as well.

Conclusion

In this notebook, you learned about how to use SageMaker Clarify to run text explainability job for SageMaker BlazingText model and get local and global explainations. This will help understand which tokens, sentences, or phrases are contributing how much to the prediction outcomes.

Cleanup

Finally, please remember to delete the Amazon SageMaker endpoint to avoid charges:

[ ]:
predictor.delete_endpoint()