Deploying pre-trained PyTorch vision models with Amazon SageMaker Neo On Inf1 Instance

This notebook’s CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

Neo is a capability of Amazon SageMaker that enables machine learning models to train once and run anywhere in the cloud and at the edge. Inf1 instances are built from the ground up to support machine learning inference applications and feature up to 16 AWS Inferentia chips, high-performance machine learning inference chips designed and built by AWS. This notebook will show you how to deploy a pretrained PyTorch model to an Inf1 instance.

Please use sagemaker version at least 2.11.0 in order to support compile PyTorch model on Inf1 instances.

[ ]:

import sys

!{sys.executable} -m pip install -qU "sagemaker>=2.11.0"

[ ]:

import sagemaker

print(sagemaker.__version__)

Import ResNet18 from TorchVision

We’ll import ResNet18 model from TorchVision and create a model artifact model.tar.gz

[ ]:

import torch
import torchvision.models as models
import tarfile

resnet18 = models.resnet18(pretrained=True)
input_shape = [1, 3, 224, 224]
trace = torch.jit.trace(resnet18.float().eval(), torch.zeros(input_shape).float())
trace.save("model.pth")

with tarfile.open("model.tar.gz", "w:gz") as f:
    f.add("model.pth")
    f.add("resnet18.py")

Compile Model with Default Settings

We will forward the model artifact to Neo Compilation API. In this section, we will compile model with one core which is the default setting for compilation.

We will go through how to compile model for multiple cores using compiler options in the next section.

[ ]:

import boto3
import sagemaker
import time
from sagemaker.utils import name_from_base

role = sagemaker.get_execution_role()
sess = sagemaker.Session()
region = sess.boto_region_name
bucket = sess.default_bucket()

compilation_job_name = name_from_base("TorchVision-ResNet18-Neo-Inf1")

model_key = "{}/model/model.tar.gz".format(compilation_job_name)
model_path = "s3://{}/{}".format(bucket, model_key)
boto3.resource("s3").Bucket(bucket).upload_file("model.tar.gz", model_key)
print("Uploaded model to s3:")
print(model_path)

sm_client = boto3.client("sagemaker")
compiled_model_path = "s3://{}/{}/output".format(bucket, compilation_job_name)
print("Output path for compiled model:")
print(compiled_model_path)

We then create a PyTorchModel object, with default settings.

[ ]:

from sagemaker.pytorch.model import PyTorchModel

pytorch_model = PyTorchModel(
    model_data=model_path,
    role=role,
    entry_point="resnet18.py",
    framework_version="1.5.1",
    py_version="py3",
)

Deploy model on Inf1 instance for real-time inferences

After creating the PyTorch model, we compile the model using Amazon SageMaker Neo to optimize performance for our desired deployment target. To compile our model for deploying on Inf1 instances, we are using the compile() method and select 'ml_inf1' as our deployment target. The compiled model will then be deployed on an endpoint using Inf1 instances in Amazon SageMaker.

Compile the model

The input_shape is the definition for the model’s input tensor and output_path is where the compiled model will be stored in S3. Important. If the following command result in a permission error, scroll up and locate the value of execution role returned by ``get_execution_role()``. The role must have access to the S3 bucket specified in ``output_path``.

[ ]:

neo_model = pytorch_model.compile(
    target_instance_family="ml_inf1",
    input_shape={"input0": [1, 3, 224, 224]},
    output_path=compiled_model_path,
    framework="pytorch",
    framework_version="1.5.1",
    role=role,
    job_name=compilation_job_name,
)

Deploy the compiled model on a SageMaker endpoint

Now that we have the compiled model, we will deploy it on an Amazon SageMaker endpoint. Inf1 instances in Amazon SageMaker are available in four sizes: ml.inf1.xlarge, ml.inf1.2xlarge, ml.inf1.6xlarge, and ml.inf1.24xlarge. In this example, we are using 'ml.inf1.xlarge' for deploying our model.

[ ]:

predictor = neo_model.deploy(instance_type="ml.inf1.xlarge", initial_instance_count=1)

[ ]:

predictor.endpoint_name

Invoking the endpoint

Once the endpoint is ready, you can send requests to it and receive inference results in real-time with low latency.

Let’s try to send a cat picture.

title

[ ]:

import json
import numpy as np

sm_runtime = boto3.Session().client("sagemaker-runtime")

with open("cat.jpg", "rb") as f:
    payload = f.read()

response = sm_runtime.invoke_endpoint(
    EndpointName=predictor.endpoint_name, ContentType="application/x-image", Body=payload
)
print(response)
result = json.loads(response["Body"].read().decode())
print("Most likely class: {}".format(np.argmax(result)))

[ ]:

# Load names for ImageNet classes
object_categories = {}
with open("imagenet1000_clsidx_to_labels.txt", "r") as f:
    for line in f:
        key, val = line.strip().split(":")
        object_categories[key] = val
print(
    "Result: label - "
    + object_categories[str(np.argmax(result))]
    + " probability - "
    + str(np.amax(result))
)

Delete the Endpoint

Having an endpoint running will incur some costs. Therefore, as a clean-up job, we should delete the endpoint.

[ ]:

sess.delete_endpoint(predictor.endpoint_name)

Compile Model for Multiple Cores using Compiler Options

In this section, we will compile the model for two cores and host on sagemaker using 2 sagemaker model server workers to utilize all 4 cores on inf1.xlarge.

[ ]:

import boto3
import sagemaker
import time
from sagemaker.utils import name_from_base

role = sagemaker.get_execution_role()
sess = sagemaker.Session()
region = sess.boto_region_name
bucket = sess.default_bucket()

compilation_job_name = name_from_base("TorchVision-ResNet18-Neo-Inf1")

model_key = "{}/model/model.tar.gz".format(compilation_job_name)
model_path = "s3://{}/{}".format(bucket, model_key)
boto3.resource("s3").Bucket(bucket).upload_file("model.tar.gz", model_key)
print("Uploaded model to s3:")
print(model_path)

sm_client = boto3.client("sagemaker")
compiled_model_path = "s3://{}/{}/output".format(bucket, compilation_job_name)
print("Output path for compiled model:")
print(compiled_model_path)

In order to host model compiled for 2 cores, we set environment variables NEURONCORE_GROUP_SIZES and SAGEMAKER_MODEL_SERVER_WORKERS. ### More Information on Environment Variables for Hosting NEURONCORE_GROUP_SIZES - If the model is compiled for n inferentia cores, set NEURONCORE_GROUP_SIZES=n. For more information on NEURONCORE_GROUP_SIZES, refer to https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/neuron-frameworks/tensorflow-neuron/tutorials/tutorial-tensorflow-NeuronCore-Group.html

SAGEMAKER_MODEL_SERVER_WORKERS - Number of workers required to utilize all inferentia cores. For example, on inf1.2xlarge or inf1.xlarge, if the model is compiled for one core, we need 4 workers to utilize all inferentia cores which will load the compiled model in different processes. If the model is compiled for 2 cores, we only need 2 workers to utilize all inferentia cores.

[ ]:

from sagemaker.pytorch.model import PyTorchModel

pytorch_model = PyTorchModel(
    model_data=model_path,
    role=role,
    entry_point="resnet18.py",
    framework_version="1.5.1",
    py_version="py3",
    env={"NEURONCORE_GROUP_SIZES": "2", "SAGEMAKER_MODEL_SERVER_WORKERS": "2"},
)

Compile the model

Then compile the model with compiler options to pass number of cores as 2.

[ ]:

neo_model = pytorch_model.compile(
    target_instance_family="ml_inf1",
    input_shape={"input0": [1, 3, 224, 224]},
    output_path=compiled_model_path,
    framework="pytorch",
    framework_version="1.5.1",
    role=role,
    job_name=compilation_job_name,
    compiler_options='"--verbose 1 --neuroncore-pipeline-cores 2"',
)

Deploy the compiled model on a SageMaker endpoint

Then we need to deploy the compiled model on an Amazon SageMaker endpoint.

[ ]:

predictor = neo_model.deploy(instance_type="ml.inf1.xlarge", initial_instance_count=1)

[ ]:

predictor.endpoint_name

Invoking the endpoint

Once the endpoint is ready, you can send requests to it and receive inference results in real-time with low latency. We will use the same cat picture.

[ ]:

import json
import numpy as np

sm_runtime = boto3.Session().client("sagemaker-runtime")

with open("cat.jpg", "rb") as f:
    payload = f.read()

response = sm_runtime.invoke_endpoint(
    EndpointName=predictor.endpoint_name, ContentType="application/x-image", Body=payload
)
print(response)
result = json.loads(response["Body"].read().decode())
print("Most likely class: {}".format(np.argmax(result)))

[ ]:

# Load names for ImageNet classes
object_categories = {}
with open("imagenet1000_clsidx_to_labels.txt", "r") as f:
    for line in f:
        key, val = line.strip().split(":")
        object_categories[key] = val
print(
    "Result: label - "
    + object_categories[str(np.argmax(result))]
    + " probability - "
    + str(np.amax(result))
)

Delete the Endpoint

Having an endpoint running will incur some costs. Therefore, as a clean-up job, we should delete the endpoint.

[ ]:

sess.delete_endpoint(predictor.endpoint_name)

Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.