Compiling HuggingFace models for AWS Inferentia with SageMaker Neo

This notebook’s CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

AWS Inferentia is Amazon’s first custom silicon designed to accelerate deep learning workloads and is part of a long-term strategy to deliver on this vision. AWS Inferentia is designed to provide high performance inference in the cloud, to drive down the total cost of inference, and to make it easy for developers to integrate machine learning into their business applications. AWS Inferentia chips deliver up 2.3x higher throughput and up to 70% lower cost per inference than comparable current generation GPU-based Amazon EC2 instances, as we will confirm in the example notebook.

AWS Neuron is a software development kit (SDK) for running machine learning inference using AWS Inferentia chips. It consists of a compiler, run-time, and profiling tools that enable developers to run high-performance and low latency inference using AWS Inferentia-based Amazon EC2 Inf1 instances. Using Neuron, you can bring your models that have been trained on any popular framework (PyTorch, TensorFlow, MXNet), and run them optimally on Inferentia. There is excellent support for Vision and NLP models especially, and on top of that we have released great features to help you make the most efficient use of the hardware, such as dynamic batching or Data Parallel inferencing.

SageMaker Neo saves you the effort of DIY model compilation, extending familiar SageMaker SDK API’s to enable easy compilation for a wide range of platforms. This includes CPU and GPU-based instances, but also Inf1 instances; in this case, SageMaker Neo uses the Neuron SDK to compile your model.

In this example notebook, we will deploy 2 HuggingFace NLP models for the task of paraphrase classification on SageMaker endpoints. One will be deployed on a GPU-accelerated instance, with no changes to the model; the other will be compiled and deployed to an Inf1 instance on SageMaker. Finally, we will perform a simple benchmark to compare the performance of both endpoints in terms of latency and throughput.

Setting up our environment

We first install some required Python packages, including transformers. We also create a default sagemaker session, get our sagemaker role and default bucket.

[ ]:

!pip install -U transformers==4.15.0
!pip install -U sagemaker
# In this example, we are using torch 1.9 which is the latest torch version Neo supports for inferentia target
!pip install -U torch==1.9.1
!pip install -U sagemaker

If you run this notebook in SageMaker Studio, you need to make sure ipywidgets is installed and restart the kernel, so please uncomment the code in the next cell, and run it.

[ ]:

# %%capture
# import IPython
# import sys

!{sys.executable} -m pip install ipywidgets
# IPython.Application.instance().kernel.do_shutdown(True)  # has to restart kernel so changes are used

[ ]:

import transformers
import sagemaker
import torch

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
sess_bucket = sagemaker_session.default_bucket()

Getting model from HuggingFace Model Hub

We choose one of the most downloaded models from the HuggingFace Model Hub for our experiments - distilbert-base-uncased. DistilBERT is a transformer model, smaller and faster than BERT, which was pretrained on the same corpus in a self-supervised fashion, using the BERT base model as a teacher in a knowledge distillation process. It is important to set the return_dict parameter to False when instantiating the model. In transformers v4.x, this parameter is True by default and it enables the return of dict-like python objects containing the model outputs, instead of the standard tuples. Neuron compilation does not support dictionary-based model ouputs, and compilation would fail if we didn’t explictly set it to False.

We also get the tokenizer corresponding to this same model, in order to create a sample input to trace our model.

[ ]:

tokenizer = transformers.AutoTokenizer.from_pretrained("distilbert-base-uncased")

model = transformers.AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", return_dict=False
)

Tracing model with `torch.jit` and uploading to S3

[ ]:

from pathlib import Path

# Create directory for model artifacts
Path("traced_model/").mkdir(exist_ok=True)

We will create a sample input to jit.trace our model with PyTorch; this is a required step to have SageMaker Neo compile our model artifact, which will take a tar.gz file containing the traced model.

The .pth extension when saving our model is required.

[ ]:

# Prepare sample input for jit model tracing
seq_0 = "This is just sample text for model tracing, the length of the sequence does not matter because we will pad to the max length that Bert accepts."
seq_1 = seq_0
max_length = 512

tokenized_sequence_pair = tokenizer.encode_plus(
    seq_0, seq_1, max_length=max_length, padding="max_length", truncation=True, return_tensors="pt"
)

example = tokenized_sequence_pair["input_ids"], tokenized_sequence_pair["attention_mask"]

traced_model = torch.jit.trace(model.eval(), example)
traced_model.save("traced_model/model.pth")

[ ]:

!tar -czvf traced_model.tar.gz -C traced_model . && mv traced_model.tar.gz traced_model/

We upload the traced model tar.gz file to Amazon S3, where our compilation job will download it from

[ ]:

traced_model_url = sagemaker_session.upload_data(
    path="traced_model/traced_model.tar.gz",
    key_prefix="neuron-experiments/bert-seq-classification/traced-model",
)

Understanding our inference code

Before we deploy any model, let’s check out the code we have written to do inference on a SageMaker endpoint, with a default uncompiled model.

[ ]:

!pygmentize code/inference_normal.py

As usual, we have a model_fn - receives the model directory, is responsible for loading and returning the model -, an input_fn and output_fn - in charge of pre-processing/checking content types of input and output to the endpoint - and a predict_fn, which receives the outputs of model_fn and input_fn (meaning, the loaded model and the deserialized/pre-processed input data) and defines how the model will run inference.

In this case, notice that we will load the model directly from the HuggingFace Model Hub for simplicity. model_fn will return a tuple containing both the model and its corresponding tokenizer. Both the model and the input data will be sent .to(device), which can be a CPU or GPU, as we can see in line 7 of the file.

Now, lets see what changes in the inference code when we want to do inference with a model that has been compiled for Inferentia

[ ]:

# %load -s model_fn code/inference_inf1.py
def model_fn(model_dir):
    model_dir = "/opt/ml/model/"
    dir_contents = os.listdir(model_dir)
    model_path = next(filter(lambda item: "model" in item, dir_contents), None)

    tokenizer_init = AutoTokenizer.from_pretrained("distilbert-base-uncased")
    model = torch.jit.load(os.path.join(model_dir, model_path))

    return (model, tokenizer_init)

[ ]:

# %load -s model_fn code/inference_inf1.py
def model_fn(model_dir):
    dir_contents = os.listdir(model_dir)
    model_path = next(filter(lambda item: "model" in item, dir_contents), None)

    tokenizer_init = AutoTokenizer.from_pretrained("distilbert-base-uncased")
    model = torch.jit.load(os.path.join(model_dir, model_path))

    return (model, tokenizer_init)

In this case, within the model_fn we first grab the model artifact located in model_dir (the compilation step will name the artifact model_neuron.pt, but we just get the first file containing model in its name for script flexibility). Then, we load the Neuron compiled model with ``torch.jit.load``.

Other than this change to model_fn, we only need to add an extra import import torch_neuron to the beginning of the script, and get rid of all .to(device) calls, since the Neuron runtime will take care of loading our model to the NeuronCores on our Inferentia instance. All other functions are unchanged.

Deploying default model to GPU-backed endpoint

Now that we understand how we will do inference, we will first deploy a normal uncompiled model to a GPU-backed g4dn instance. Typically, this is a great instance type in terms of price-performance ratio that still provides GPU-acceleration.

Although we will be passing the traced_model_url as the model_data parameter to the PyTorchModel API, as we saw we will be pulling the model directly from the HuggingFace Model Hub directly in the inference script; this won’t affect our benchmark in any way, since model_fn gets executed before any request even reaches the endpoint. We are using PyTorchModel here instead of the HuggingFace specific (and optimized) `HuggingFaceModel <https://sagemaker.readthedocs.io/en/stable/frameworks/huggingface/sagemaker.huggingface.html#hugging-face-model>`__ for the simple reason that the latter is not integrated with SageMaker Neo at the time of writing, and we want to ensure a similar, standard setup for deploying both models. Anyhow, you will definitely benefit from using HuggingFace specific SageMaker API’s if you are working with HuggingFace Models, but are not looking for model compilation.

Notice that we are passing inference_normal.py as our entry point script; also, the packages defined in the requirements file within our source_dir will automatically be installed on our endpoint instance. In this case we only need the latest version of the transformers library that is good to go on Inferentia instances, v. 4.15.0

[ ]:

from sagemaker.pytorch.model import PyTorchModel
from sagemaker.predictor import Predictor
from datetime import datetime

prefix = "neuron-experiments/bert-seq-classification"
flavour = "normal"
date_string = datetime.now().strftime("%Y%m-%d%H-%M%S")

normal_sm_model = PyTorchModel(
    model_data=traced_model_url,
    predictor_cls=Predictor,
    framework_version="1.8",
    role=role,
    sagemaker_session=sagemaker_session,
    entry_point="inference_normal.py",
    source_dir="code",
    py_version="py3",
    name=f"{flavour}-distilbert-pt181-{date_string}",
    env={"SAGEMAKER_CONTAINER_LOG_LEVEL": "10"},
)

[ ]:

from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer

hardware = "g4dn"

normal_predictor = normal_sm_model.deploy(
    instance_type="ml.g4dn.xlarge",
    initial_instance_count=1,
    endpoint_name=f"distilbert-{flavour}-{hardware}-{date_string}",
    serializer=JSONSerializer(),
    deserializer=JSONDeserializer(),
)

A quick test that our endpoint is responding as expected, using the sequences built further up in this notebook.

[ ]:

payload = seq_0, seq_1
normal_predictor.predict(payload)

Compiling and deploying model on Inferentia instance

We now create a new PyTorchModel that will use inference_inf1.py as its entry point script. PyTorch version 1.9.1 is the latest that supports Neo compilation to Inferentia, as you can see from the warning in the compilation cell output.

[ ]:

prefix = "neuron-experiments/bert-seq-classification"
flavour = "normal"
date_string = datetime.now().strftime("%Y%m-%d%H-%M%S")

compiled_sm_model = PyTorchModel(
    model_data=traced_model_url,
    predictor_cls=Predictor,
    framework_version="1.9.1",
    role=role,
    sagemaker_session=sagemaker_session,
    entry_point="inference_inf1.py",
    source_dir="code",
    py_version="py3",
    name=f"{flavour}-distilbert-pt191-{date_string}",
    env={"SAGEMAKER_CONTAINER_LOG_LEVEL": "10"},
)

Finally, we are ready to compile the model. Two notes here: * HuggingFace models should be compiled to dtype int64 * the format for compiler_options differs from the standard Python dict that you can use when compiling for “normal” instance types; for inferentia, you must provide a JSON string with CLI arguments, which correspond to the ones supported by the Neuron Compiler (read more about compiler_options here)

[ ]:

%%time
import json

hardware = "inf1"
flavour = "compiled-inf"
compilation_job_name = f"distilbert-{flavour}-{hardware}-" + date_string

compiled_inf1_model = compiled_sm_model.compile(
    target_instance_family=f"ml_{hardware}",
    input_shape={"input_ids": [1, 512], "attention_mask": [1, 512]},
    job_name=compilation_job_name,
    role=role,
    framework="pytorch",
    framework_version="1.9.1",
    output_path=f"s3://{sess_bucket}/{prefix}/neo-compilations/{flavour}-model",
    compiler_options=json.dumps("--dtype int64"),
    #     compiler_options={'dtype': 'int64'},    # For compiling to "normal" instance types, cpu or gpu-based
    compile_max_run=900,
)

After successful compilation, we deploy our model to an inf1.xlarge instance.

[ ]:

%%time
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer

date_string = datetime.now().strftime("%Y%m-%d%H-%M%S")

compiled_inf1_predictor = compiled_inf1_model.deploy(
    instance_type="ml.inf1.xlarge",
    initial_instance_count=1,
    endpoint_name=f"test-neo-{hardware}-{date_string}",
    serializer=JSONSerializer(),
    deserializer=JSONDeserializer(),
)

Again, we test if everything is running smoothly in our endpoint.

[ ]:

# Predict with model endpoint
payload = seq_0, seq_1
compiled_inf1_predictor.predict(payload)

Benchmark and comparison

We will now perform a simple benchmark of both endpoints, using Python’s threading module. In each benchmark, we start 5 threads that will each make 300 requests to the model endpoint. We measure the inference latency for each request, and we also measure the total time to finish the task, so that we can get an estimate of the request throughput/second.

We first benchmark the uncompiled endpoint.

[ ]:

import threading
import time

num_preds = 300
num_threads = 5

times = []


def predict():
    thread_id = threading.get_ident()
    print(f"Thread {thread_id} started")

    for i in range(num_preds):
        tick = time.time()
        response = normal_predictor.predict(payload)
        tock = time.time()
        times.append((thread_id, tock - tick))


threads = []
[threads.append(threading.Thread(target=predict, daemon=False)) for i in range(num_threads)]
[t.start() for t in threads]

# Wait for threads, get an estimate of total time
start = time.time()
[t.join() for t in threads]
end = time.time() - start

[ ]:

from matplotlib.pyplot import hist, title, show, xlim
import numpy as np

TPS = (num_preds * num_threads) / end

t = [duration for thread__id, duration in times]
latency_percentiles = np.percentile(t, q=[50, 90, 95, 99])

hist(t, bins=100)
title("Request latency histogram on GPU")
xlim(0, 0.2)
show()

print("==== Default HuggingFace model on GPU benchmark ====\n")
print(f"95 % of requests take less than {latency_percentiles[2]*1000} ms")
print(f"Rough request throughput/second is {TPS}")

Default Benchmark

We can see that request latency is pretty concentrated around the 85-90 millisecond range, and throughput is around ~60 TPS.

Now, we benchmark our compiled model running on Inferentia.

[ ]:

import threading
import time
import boto3

num_preds = 300
num_threads = 5

times = []


def predict():
    thread_id = threading.get_ident()
    print(f"Thread {thread_id} started")

    for i in range(num_preds):
        tick = time.time()
        response = compiled_inf1_predictor.predict(payload)
        tock = time.time()
        times.append((thread_id, tock - tick))


threads = []
[threads.append(threading.Thread(target=predict, daemon=False)) for i in range(num_threads)]
[t.start() for t in threads]

# Make a rough estimate of total time, wait for threads
start = time.time()
[t.join() for t in threads]
end = time.time() - start

[ ]:

from matplotlib.pyplot import hist, title, show, savefig, xlim
import numpy as np

TPS = (num_preds * num_threads) / end

t = [duration for thread__id, duration in times]
latency_percentiles = np.percentile(t, q=[50, 90, 95, 99])

hist(t, bins=100)
title("Request latency histogram for Inferentia")
xlim(0, 0.2)
show()

print("==== HuggingFace model compiled for Inferentia benchmark ====\n")
print(f"95 % of requests take less than {latency_percentiles[2]*1000} ms")
print(f"Rough request throughput/second is {TPS}")

Compiled Benchmark

In this case, we can see that latency has dropped to a staggering 25-30 millisecond range - around a 70% latency decrease - while throughput has increased to 220 TPS - almost a 400% increase! 🤯🤯🤯

Best of all, the on-demand price of the Inferentia instance type we have used (ml.inf1.xlarge) for SageMaker Real Time Inference is around 60% lower than ml.g4dn.xlarge, already the lowest-cost GPU instance option (Ireland region at the time of writing)

Conclusions

[ ]:

normal_predictor.delete_model()
normal_predictor.delete_endpoint()
compiled_inf1_predictor.delete_model()
compiled_inf1_predictor.delete_endpoint()

Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.