Using Amazon Elastic Inference with Neo-compiled TensorFlow model on SageMaker

This notebook demonstrates how to compile a pre-trained TensorFlow model using Amazon SageMaker Neo and how to deploy this model to a SageMaker Endpoint with Elastic Inference

Amazon Elastic Inference (EI) allows you to add inference acceleration to an Amazon SageMaker hosted endpoint for a fraction of the cost of using a full GPU instance. Running Neo-compiled models on EI provides a performance boost by optimizing the model to produce low latency inferences. This would increase inference throughput and further reduce costs. For more information please visit: https://docs.aws.amazon.com/sagemaker/latest/dg/ei.html

This notebook is an adaption of the Deploy pre-trained TensorFlow model to SageMaker with Elastic Inference notebook, with modifications showing the changes needed to deploy Neo-compiled models on SageMaker with EI.

For this example, we will use the SageMaker Python SDK, which makes it easy to compile and deploy your model on SageMaker.

  1. Set up the environment

  2. Get pre-trained model for compilation

    1. Import ResNet50 model from Keras

    2. Upload model artifact to S3 bucket

  3. Compile model for EI accelerator using Neo

  4. Deploy compiled model to SageMaker Endpoint with EI accelerator attached

  5. Make an inference request to the endpoint

  6. Delete the endpoint

Set up the environment

Let’s start by creating a SageMaker session and specifying:

  • The S3 bucket that you want to use for model data. This should be within the same region as the Notebook Instance, Neo compilation, and SageMaker hosting.

  • The IAM role arn used to give compilation and hosting access to your data. See the documentation for how to create these. Note: If more than one role is required for notebook instances, compilation, and hosting, please replace the sagemaker.get_execution_role() with a the appropriate full IAM role arn string(s).

[ ]:
import sagemaker

session = sagemaker.Session()
bucket = session.default_bucket()
role = sagemaker.get_execution_role()

Get pre-trained model for compilation

Amazon SageMaker Neo supports compiling TensorFlow models in SavedModel format and frozen graph format for EI accelerators. We would be using a ResNet50 model in SavedModel format from Keras in this example.

Import ResNet50 model from Keras

We will import ResNet50 model from Keras and create a model artifact model.tar.gz.

[ ]:
import tensorflow as tf
import tarfile
import os

tf.keras.backend.set_image_data_format("channels_last")
pretrained_model = tf.keras.applications.resnet.ResNet50()
saved_model_dir = "1"
tf.saved_model.save(pretrained_model, saved_model_dir)

with tarfile.open("model.tar.gz", "w:gz") as tar:
    tar.add(saved_model_dir)

Upload model artifact to S3 bucket

Amazon SageMaker Neo expects a path to the model artifact in Amazon S3, so we will upload the model artifact to be compiled to S3 bucket.

[ ]:
from sagemaker.utils import name_from_base

compilation_job_name = name_from_base("Keras-ResNet50")
input_model_path = session.upload_data(
    path="model.tar.gz", bucket=bucket, key_prefix=compilation_job_name
)
print("S3 path for input model: {}".format(input_model_path))

Compile model for EI accelerator using Neo

Now the model is ready to be compiled by Neo. Note that ml_eia2 needs to be set for target_instance_family field in order for the model to be optimized for EI accelerator. If you want to compile your own model for EI accelerataor, refer to Neo compilation API to provide the proper input_shape and optional compiler_options according to your model.

Important: If the following command result in a permission error, scroll up and locate the value of execution role returned by get_execution_role(). The role must have access to the S3 bucket specified in output_path.

[ ]:
from sagemaker.tensorflow import TensorFlowModel

# Create a TensorFlow SageMaker model
tensorflow_model = TensorFlowModel(model_data=input_model_path, role=role, framework_version="2.3")

# Compile the model for EI accelerator in SageMaker Neo
output_path = "/".join(input_model_path.split("/")[:-1])
tensorflow_model.compile(
    target_instance_family="ml_eia2",
    input_shape={"input_1": [1, 224, 224, 3]},
    output_path=output_path,
    role=role,
    job_name=compilation_job_name,
    framework="tensorflow",
)

Deploy compiled model to SageMaker Endpoint with EI accelerator attached

The same methods are used to deploy a model to a SageMaker Endpoint with EI regardless of whether or not the model is compiled or not compiled by Neo.

The only change required for utilizing EI is to provide an accelerator_type parameter, which determines the type of EI accelerator to be attached to your endpoint. The supported types of accelerators can be found here: https://aws.amazon.com/machine-learning/elastic-inference/pricing/

[ ]:
predictor = tensorflow_model.deploy(
    initial_instance_count=1, instance_type="ml.m5.xlarge", accelerator_type="ml.eia2.large"
)

Make an inference request to the endpoint

Now that the endpoint is deployed with our compiled model and we have a predictor object, we can use it to send inference request. Note that the first inference call would usually take longer time, this is known as the warm-up inference.

[ ]:
%%time
import numpy as np

random_input = np.random.rand(1, 224, 224, 3)
prediction = predictor.predict({"inputs": random_input.tolist()})
print(prediction)

Delete the endpoint

Having an endpoint running will incur some costs. Therefore, we would delete the endpoint to release the resources after finishing this example.

[ ]:
session.delete_endpoint(predictor.endpoint_name)