Compile and Deploy a MXNet model on Inf1 instances

This notebook’s CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

Amazon SageMaker now supports Inf1 instances for high performance and cost-effective inferences. Inf1 instances are ideal for large scale machine learning inference applications like image recognition, speech recognition, natural language processing, personalization, and fraud detection. In this example, we train a classification model on the MNIST dataset using MXNet, compile it using Amazon SageMaker Neo, and deploy the model on Inf1 instances on a SageMaker endpoint and use the Neo Deep Learning Runtime to make inferences in real-time and with low latency.

Inf 1 instances

Inf1 instances are built from the ground up to support machine learning inference applications and feature up to 16 AWS Inferentia chips, high-performance machine learning inference chips designed and built by AWS. The Inferentia chips are coupled with the latest custom 2nd generation Intel® Xeon® Scalable processors and up to 100 Gbps networking to enable high throughput inference. With 1 to 16 AWS Inferentia chips per instance, Inf1 instances can scale in performance to up to 2000 Tera Operations per Second (TOPS) and deliver extremely low latency for real-time inference applications. The large on-chip memory on AWS Inferentia chips used in Inf1 instances allows caching of machine learning models directly on the chip. This eliminates the need to access outside memory resources during inference, enabling low latency without impacting bandwidth.

Set up the environment

We need to first upgrade the SageMaker SDK for Python to v2.33.0 or greater & restart the kernel

[ ]:

!~/anaconda3/envs/mxnet_p36/bin/pip install --upgrade sagemaker

[ ]:

import sagemaker
from sagemaker import get_execution_role
from sagemaker.session import Session

sagemaker_session = sagemaker.Session()
# S3 bucket for saving code and model artifacts.
# Feel free to specify a different bucket here if you wish.
bucket = Session().default_bucket()

# Location to save your custom code in tar.gz format.
custom_code_upload_location = "s3://{}/customcode/mxnet".format(bucket)

# Location where results of model training are saved.
model_artifacts_location = "s3://{}/artifacts".format(bucket)

# IAM execution role that gives SageMaker access to resources in your AWS account.
# We can use the SageMaker Python SDK to get the role from our notebook environment.
role = get_execution_role()

Construct a script for training and hosting

The mnist.py script provides all the code we need for training and hosting a SageMaker model.

[ ]:

!cat mnist.py

[ ]:

from sagemaker.mxnet import MXNet

mnist_estimator = MXNet(
    entry_point="mnist.py",
    role=role,
    output_path=model_artifacts_location,
    code_location=custom_code_upload_location,
    instance_count=1,
    instance_type="ml.m4.xlarge",
    framework_version="1.8",
    py_version="py37",
    hyperparameters={"learning-rate": 0.1},
)

The ``fit`` method will create a training job in a ml.m4.xlarge instances. The logs below will show the instances doing training, evaluation, and incrementing the number of training steps.

In the end of the training, the training job will generate a saved model for compilation.

[ ]:

%%time
import boto3

region = boto3.Session().region_name
train_data_location = "s3://sagemaker-sample-data-{}/mxnet/mnist/train".format(region)
test_data_location = "s3://sagemaker-sample-data-{}/mxnet/mnist/test".format(region)

mnist_estimator.fit({"train": train_data_location, "test": test_data_location})

Deploy the trained model on Inf1 instance for real-time inferences

Once the training is complete, we compile the model using Amazon SageMaker Neo to optize performance for our desired deployment target. Amazon SageMaker Neo enables you to train machine learning models once and run them anywhere in the cloud and at the edge. To compile our trained model for deploying on Inf1 instances, we are using the MXNetEstimator.compile_model method and select 'ml_inf1' as our deployment target. The compiled model will then be deployed on an endpoint using Inf1 instances in Amazon SageMaker.

The input_shape is the definition for the model’s input tensor and output_path is where the compiled model will be stored in S3. Important. If the following command result in a permission error, scroll up and locate the value of execution role returned by ``get_execution_role()``. The role must have access to the S3 bucket specified in ``output_path``.

[ ]:

output_path = "/".join(mnist_estimator.output_path.split("/")[:-1])
mnist_estimator.framework_version = "1.5.1"

optimized_estimator = mnist_estimator.compile_model(
    target_instance_family="ml_inf1",
    input_shape={"data": [1, 1, 28, 28]},
    role=role,
    framework="mxnet",
    framework_version="1.5.1",
    output_path=output_path,
)

Now that we have the compiled model, we will deploy it on an Amazon SageMaker endpoint. Inf1 instances in Amazon SageMaker are available in four sizes: ml.inf1.xlarge, ml.inf1.2xlarge, ml.inf1.6xlarge, and ml.inf1.24xlarge. In this example, we are using 'ml.inf1.xlarge' for deploying our model.

[ ]:

from sagemaker.serializers import NumpySerializer

npy_serializer = NumpySerializer()
optimized_predictor = optimized_estimator.deploy(
    initial_instance_count=1, instance_type="ml.inf1.xlarge", serializer=npy_serializer
)

Once the endpoint is ready, you can send requests to it and receive inference results in real-time with low latency.

[ ]:

import numpy as np

numpy_ndarray = np.load("input.npy")

[ ]:

response = optimized_predictor.predict(data=numpy_ndarray)
print("Raw prediction result:")
print(response)

labeled_predictions = list(zip(range(10), response))
print("Labeled predictions: ")
print(labeled_predictions)

labeled_predictions.sort(key=lambda label_and_prob: 1.0 - label_and_prob[1])
print("Most likely answer: {}".format(labeled_predictions[0]))

Delete the endpoint if you no longer need it.

[ ]:

print("Endpoint name: " + optimized_predictor.endpoint_name)
optimized_predictor.delete_endpoint()

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.