Host a Pretrained Model on SageMaker

This notebook’s CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

Amazon SageMaker is a service to accelerate the entire machine learning lifecycle. It includes components for building, training and deploying machine learning models. Each SageMaker component is modular, so you’re welcome to only use the features needed for your use case. One of the most popular features of SageMaker is model hosting. Using SageMaker hosting, you can deploy your model as a scalable, highly available, multi-process API endpoint with a few lines of code. Read more at Deploy a Model in Amazon SageMaker. In this notebook, we demonstrate how to host a pretrained BERT model in Amazon SageMaker to extract embeddings from text.

SageMaker provides prebuilt containers that can be used for training, hosting, or data processing. The inference containers include a web serving stack, so you don’t need to install and configure one. We use the SageMaker PyTorch container, but you may use the TensorFlow container, or bring your own container if needed. See all containers at AWS Deep Learning Containers.

This notebook walks you through how to deploy a pretrained Hugging Face model as a scalable, highly available, production-ready API.

Runtime

This notebook takes approximately 5 minutes to run.

Retrieve Model Artifacts

First we download the model artifacts for the pretrained BERT model. BERT is a popular natural language processing (NLP) model that extracts meaning and context from text. You can read the original paper, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.

[ ]:

!pip install transformers==3.3.1 sagemaker==2.15.0 --quiet

[ ]:

import os
from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")

model_path = "model/"
code_path = "code/"

if not os.path.exists(model_path):
    os.mkdir(model_path)

model.save_pretrained(save_directory=model_path)
tokenizer.save_pretrained(save_directory=model_path)

Write the Inference Script

Since we are bringing a model to SageMaker, we must create an inference script. The script runs inside our PyTorch container. Our script should include a function for model loading, and optionally functions generating predictions, and input/output processing. The PyTorch container provides default implementations for generating a prediction and input/output processing. By including these functions in your script you are overriding the default functions. You can find additional details at Serve a PyTorch Model.

The next cell shows our inference script, whcich uses the Transformers library from HuggingFace. This library is not installed in the container by default, so we add it in the next section.

[ ]:

!pygmentize code/inference_code.py

Package Model

For hosting, SageMaker requires that the deployment package be structured in a compatible format. It expects all files to be packaged in a tar archive named “model.tar.gz” with gzip compression. To install additional libraries at container startup, we can add a requirements.txt file that specifies the libraries to be installed using pip. Read more at Using Third-Party Libraries. Within the archive, the PyTorch container expects all inference code and requirements.txt file to be inside the code/ directory. See the Model Directory Structure guide for a thorough explanation of the required directory structure.

[ ]:

import tarfile

zipped_model_path = os.path.join(model_path, "model.tar.gz")

with tarfile.open(zipped_model_path, "w:gz") as tar:
    tar.add(model_path)
    tar.add(code_path)

Deploy Model

Now that we have our deployment package, we can use the SageMaker Python SDK to deploy our API endpoint with two lines of code. We need to specify an IAM role for the SageMaker endpoint to use. Minimally, it needs read access to the default SageMaker bucket (usually named s3://sagemaker-{region}-{your account ID}) so it can read the deployment package. When we call deploy(), the SDK saves our deployment archive to S3 for the SageMaker endpoint to use. We use the helper function get_execution_role() to retrieve our current IAM role so we can pass it to the SageMaker endpoint. Minimally it requires read access to the model artifacts in S3 and the ECR repository where the container image is stored by AWS.

You may notice that we specify our PyTorch version and Python version when creating the PyTorchModel object. The SageMaker SDK uses these parameters to determine which PyTorch container to use.

We use an m5.xlarge instance for our endpoint to ensure we have sufficient memory to serve our model.

[ ]:

from sagemaker.pytorch import PyTorchModel
from sagemaker import get_execution_role
import time

endpoint_name = "bert-base-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

model = PyTorchModel(
    entry_point="inference_code.py",
    model_data=zipped_model_path,
    role=get_execution_role(),
    framework_version="1.5",
    py_version="py3",
)

predictor = model.deploy(
    initial_instance_count=1, instance_type="ml.m5.xlarge", endpoint_name=endpoint_name
)

Get Predictions

Now that our API endpoint is deployed, we send it text to get predictions from our BERT model. You can use the SageMaker SDK or the InvokeEndpoint method of the SageMaker Runtime API to invoke the endpoint.

[ ]:

import sagemaker

sm = sagemaker.Session().sagemaker_runtime_client

prompt = "The best part of Amazon SageMaker is that it makes machine learning easy."

response = sm.invoke_endpoint(
    EndpointName=endpoint_name, Body=prompt.encode(encoding="UTF-8"), ContentType="text/csv"
)

response["Body"].read()

Cleanup

Delete the model and endpoint to release resources and stop incurring costs.

[ ]:

predictor.delete_model()
predictor.delete_endpoint()

Conclusion

We have successfully created a scalable, highly available, RESTful API that is backed by a BERT model! It can be used for downstream NLP tasks like text classification. If you are still interested in learning more, check out some of the more advanced features of SageMaker hosting, like Monitor models for data and model quality, bias, and explainability to detect concept drift, Automatically Scale Amazon SageMaker Models to dynamically adjust the number of instances, or Give SageMaker Hosted Endpoints Access to Resources in Your Amazon VPC to control network access to/from your endpoint.

You can also read the blog Deploy machine learning models to Amazon SageMaker using the ezsmdeploy Python package and a few lines of code. The ezsmdeploy package automates most of this process.

Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.