Training and hosting SageMaker Models using the Apache MXNet Module API

The SageMaker Python SDK makes it easy to train and deploy Apache MXNet models. In this example, we train a simple neural network using the Apache MXNet Module API and the MNIST dataset. The MNIST dataset is widely used for handwritten digit classification, and consists of 70,000 labeled 28x28 pixel grayscale images of hand-written digits. The dataset is split into 60,000 training images and 10,000 test images. There are 10 classes (one for each of the 10 digits). The task at hand is to train a model using the 60,000 training images and subsequently test its classification accuracy on the 10,000 test images.

Setup

First we define a few variables that are needed later in this example.

[1]:
from sagemaker import get_execution_role
from sagemaker.session import Session

# S3 bucket for saving code and model artifacts.
# Feel free to specify a different bucket here if you wish.
bucket = Session().default_bucket()

# Bucket location where your custom code will be saved in the tar.gz format.
custom_code_upload_location = "s3://{}/mxnet-mnist-example/code".format(bucket)

# Bucket location where results of model training are saved.
model_artifacts_location = "s3://{}/mxnet-mnist-example/artifacts".format(bucket)

# IAM execution role that gives SageMaker access to resources in your AWS account.
# We can use the SageMaker Python SDK to get the role from our notebook environment.
role = get_execution_role()

The training script

The mnist.py script provides all the code we need for training and hosting a SageMaker model. The script also checkpoints the model at the end of every epoch and saves the model graph, params and optimizer state in the folder /opt/ml/checkpoints. If the folder path does not exist then it skips checkpointing. The script we use is adaptated from Apache MXNet MNIST tutorial.

[2]:
!pygmentize mnist.py
import argparse
import gzip
import json
import logging
import os
import struct

import mxnet as mx
import numpy as np


def load_data(path):
    with gzip.open(find_file(path, "labels.gz")) as flbl:
        struct.unpack(">II", flbl.read(8))
        labels = np.fromstring(flbl.read(), dtype=np.int8)
    with gzip.open(find_file(path, "images.gz")) as fimg:
        _, _, rows, cols = struct.unpack(">IIII", fimg.read(16))
        images = np.fromstring(fimg.read(), dtype=np.uint8).reshape(len(labels), rows, cols)
        images = images.reshape(images.shape[0], 1, 28, 28).astype(np.float32) / 255
    return labels, images


def find_file(root_path, file_name):
    for root, dirs, files in os.walk(root_path):
        if file_name in files:
            return os.path.join(root, file_name)


def build_graph():
    data = mx.sym.var("data")
    data = mx.sym.flatten(data=data)
    fc1 = mx.sym.FullyConnected(data=data, num_hidden=128)
    act1 = mx.sym.Activation(data=fc1, act_type="relu")
    fc2 = mx.sym.FullyConnected(data=act1, num_hidden=64)
    act2 = mx.sym.Activation(data=fc2, act_type="relu")
    fc3 = mx.sym.FullyConnected(data=act2, num_hidden=10)
    return mx.sym.SoftmaxOutput(data=fc3, name="softmax")


def get_training_context(num_gpus):
    if num_gpus:
        return [mx.gpu(i) for i in range(num_gpus)]
    else:
        return mx.cpu()


def train(
    batch_size,
    epochs,
    learning_rate,
    num_gpus,
    training_channel,
    testing_channel,
    hosts,
    current_host,
    model_dir,
):
    checkpoints_dir = "/opt/ml/checkpoints"
    checkpoints_enabled = os.path.exists(checkpoints_dir)

    (train_labels, train_images) = load_data(training_channel)
    (test_labels, test_images) = load_data(testing_channel)
    # Data parallel training - shard the data so each host
    # only trains on a subset of the total data.
    shard_size = len(train_images) // len(hosts)
    for i, host in enumerate(hosts):
        if host == current_host:
            start = shard_size * i
            end = start + shard_size
            break

    train_iter = mx.io.NDArrayIter(
        train_images[start:end], train_labels[start:end], batch_size, shuffle=True
    )
    val_iter = mx.io.NDArrayIter(test_images, test_labels, batch_size)

    logging.getLogger().setLevel(logging.DEBUG)

    kvstore = "local" if len(hosts) == 1 else "dist_sync"

    mlp_model = mx.mod.Module(symbol=build_graph(), context=get_training_context(num_gpus))

    checkpoint_callback = None
    if checkpoints_enabled:
        # Create a checkpoint callback that checkpoints the model params and
        # the optimizer state at the given path after every epoch.
        checkpoint_callback = mx.callback.module_checkpoint(
            mlp_model, os.path.join(checkpoints_dir, "mnist"), period=1, save_optimizer_states=True
        )
    mlp_model.fit(
        train_iter,
        eval_data=val_iter,
        kvstore=kvstore,
        optimizer="sgd",
        optimizer_params={"learning_rate": learning_rate},
        eval_metric="acc",
        epoch_end_callback=checkpoint_callback,
        batch_end_callback=mx.callback.Speedometer(batch_size, 100),
        num_epoch=epochs,
    )

    if current_host == hosts[0]:
        save(model_dir, mlp_model)


def save(model_dir, model):
    model.symbol.save(os.path.join(model_dir, "model-symbol.json"))
    model.save_params(os.path.join(model_dir, "model-0000.params"))

    signature = [
        {"name": data_desc.name, "shape": [dim for dim in data_desc.shape]}
        for data_desc in model.data_shapes
    ]
    with open(os.path.join(model_dir, "model-shapes.json"), "w") as f:
        json.dump(signature, f)


def parse_args():
    parser = argparse.ArgumentParser()

    parser.add_argument("--batch-size", type=int, default=100)
    parser.add_argument("--epochs", type=int, default=10)
    parser.add_argument("--learning-rate", type=float, default=0.1)

    parser.add_argument("--model-dir", type=str, default=os.environ["SM_MODEL_DIR"])
    parser.add_argument("--train", type=str, default=os.environ["SM_CHANNEL_TRAIN"])
    parser.add_argument("--test", type=str, default=os.environ["SM_CHANNEL_TEST"])

    parser.add_argument("--current-host", type=str, default=os.environ["SM_CURRENT_HOST"])
    parser.add_argument("--hosts", type=list, default=json.loads(os.environ["SM_HOSTS"]))

    return parser.parse_args()


### NOTE: this function cannot use MXNet
def neo_preprocess(payload, content_type):
    import io
    import logging

    import numpy as np

    logging.info("Invoking user-defined pre-processing function")

    if content_type != "application/vnd+python.numpy+binary":
        raise RuntimeError("Content type must be application/vnd+python.numpy+binary")

    f = io.BytesIO(payload)
    return np.load(f)


### NOTE: this function cannot use MXNet
def neo_postprocess(result):
    import json
    import logging

    import numpy as np

    logging.info("Invoking user-defined post-processing function")

    # Softmax (assumes batch size 1)
    result = np.squeeze(result)
    result_exp = np.exp(result - np.max(result))
    result = result_exp / np.sum(result_exp)

    response_body = json.dumps(result.tolist())
    content_type = "application/json"

    return response_body, content_type


if __name__ == "__main__":
    args = parse_args()
    num_gpus = int(os.environ["SM_NUM_GPUS"])

    train(
        args.batch_size,
        args.epochs,
        args.learning_rate,
        num_gpus,
        args.train,
        args.test,
        args.hosts,
        args.current_host,
        args.model_dir,
    )

SageMaker’s MXNet estimator class

The SageMaker MXNet estimator allows us to run single machine or distributed training in SageMaker, using CPU or GPU-based instances.

When we create the estimator, we pass in the filename of our training script, the name of our IAM execution role, and the S3 locations we defined in the setup section. We also provide a few other parameters. train_instance_count and train_instance_type determine the number and type of SageMaker instances that will be used for the training job. The hyperparameters parameter is a dict of values that will be passed to your training script – you can see how to access these values in the mnist.py script above.

For this example, we will choose one ml.m4.xlarge instance.

[3]:
from sagemaker.mxnet import MXNet

mnist_estimator = MXNet(
    entry_point="mnist.py",
    role=role,
    output_path=model_artifacts_location,
    code_location=custom_code_upload_location,
    instance_count=1,
    instance_type="ml.m4.xlarge",
    framework_version="1.4.1",
    py_version="py3",
    #distribution={"parameter_server": {"enabled": True}},
    hyperparameters={"learning-rate": 0.1},
)
distributions has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.

Running the Training Job

After we’ve constructed our MXNet object, we can fit it using data stored in S3. Below we run SageMaker training on two input channels: train and test.

During training, SageMaker makes this data stored in S3 available in the local filesystem where the mnist script is running. The mnist.py script simply loads the train and test data from disk.

[4]:
%%time
import boto3

region = boto3.Session().region_name
train_data_location = "s3://sagemaker-sample-data-{}/mxnet/mnist/train".format(region)
test_data_location = "s3://sagemaker-sample-data-{}/mxnet/mnist/test".format(region)

mnist_estimator.fit({"train": train_data_location, "test": test_data_location})
2021-06-09 21:55:39 Starting - Starting the training job...
2021-06-09 21:55:41 Starting - Launching requested ML instances......
2021-06-09 21:56:50 Starting - Preparing the instances for training......
2021-06-09 21:57:59 Downloading - Downloading input data...
2021-06-09 21:58:24 Training - Downloading the training image..2021-06-09 21:58:47,583 sagemaker-containers INFO     Imported framework sagemaker_mxnet_container.training
2021-06-09 21:58:47,587 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
2021-06-09 21:58:47,603 sagemaker_mxnet_container.training INFO     MXNet training environment: {'SM_HOSTS': '["algo-1"]', 'SM_NETWORK_INTERFACE_NAME': 'eth0', 'SM_HPS': '{"learning-rate":0.1}', 'SM_USER_ENTRY_POINT': 'mnist.py', 'SM_FRAMEWORK_PARAMS': '{"sagemaker_parameter_server_enabled":true}', 'SM_RESOURCE_CONFIG': '{"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"}', 'SM_INPUT_DATA_CONFIG': '{"test":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"},"train":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}}', 'SM_OUTPUT_DATA_DIR': '/opt/ml/output/data', 'SM_CHANNELS': '["test","train"]', 'SM_CURRENT_HOST': 'algo-1', 'SM_MODULE_NAME': 'mnist', 'SM_LOG_LEVEL': '20', 'SM_FRAMEWORK_MODULE': 'sagemaker_mxnet_container.training:main', 'SM_INPUT_DIR': '/opt/ml/input', 'SM_INPUT_CONFIG_DIR': '/opt/ml/input/config', 'SM_OUTPUT_DIR': '/opt/ml/output', 'SM_NUM_CPUS': '4', 'SM_NUM_GPUS': '0', 'SM_MODEL_DIR': '/opt/ml/model', 'SM_MODULE_DIR': 's3://sagemaker-us-west-2-688520471316/mxnet-mnist-example/code/mxnet-training-2021-06-09-21-55-39-493/source/sourcedir.tar.gz', 'SM_TRAINING_ENV': '{"additional_framework_parameters":{"sagemaker_parameter_server_enabled":true},"channel_input_dirs":{"test":"/opt/ml/input/data/test","train":"/opt/ml/input/data/train"},"current_host":"algo-1","framework_module":"sagemaker_mxnet_container.training:main","hosts":["algo-1"],"hyperparameters":{"learning-rate":0.1},"input_config_dir":"/opt/ml/input/config","input_data_config":{"test":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"},"train":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}},"input_dir":"/opt/ml/input","is_master":true,"job_name":"mxnet-training-2021-06-09-21-55-39-493","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://sagemaker-us-west-2-688520471316/mxnet-mnist-example/code/mxnet-training-2021-06-09-21-55-39-493/source/sourcedir.tar.gz","module_name":"mnist","network_interface_name":"eth0","num_cpus":4,"num_gpus":0,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"},"user_entry_point":"mnist.py"}', 'SM_USER_ARGS': '["--learning-rate","0.1"]', 'SM_OUTPUT_INTERMEDIATE_DIR': '/opt/ml/output/intermediate', 'SM_CHANNEL_TEST': '/opt/ml/input/data/test', 'SM_CHANNEL_TRAIN': '/opt/ml/input/data/train', 'SM_HP_LEARNING-RATE': '0.1'}
2021-06-09 21:58:49,033 sagemaker_mxnet_container.training INFO     Starting distributed training task

2021-06-09 21:58:46 Training - Training image download completed. Training in progress.2021-06-09 21:59:50,941 sagemaker-containers INFO     Module mnist does not provide a setup.py. 
Generating setup.py
2021-06-09 21:59:50,942 sagemaker-containers INFO     Generating setup.cfg
2021-06-09 21:59:50,942 sagemaker-containers INFO     Generating MANIFEST.in
2021-06-09 21:59:50,942 sagemaker-containers INFO     Installing module with the following command:
/usr/local/bin/python3.6 -m pip install -U . 
Processing /opt/ml/code
Installing collected packages: mnist
  Running setup.py install for mnist: started
    Running setup.py install for mnist: finished with status 'done'
Successfully installed mnist-1.0.0
WARNING: You are using pip version 19.1.1, however version 21.1.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
2021-06-09 21:59:52,869 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
2021-06-09 21:59:52,887 sagemaker-containers INFO     Invoking user script

Training Env:

{
    "additional_framework_parameters": {
        "sagemaker_parameter_server_enabled": true
    },
    "channel_input_dirs": {
        "test": "/opt/ml/input/data/test",
        "train": "/opt/ml/input/data/train"
    },
    "current_host": "algo-1",
    "framework_module": "sagemaker_mxnet_container.training:main",
    "hosts": [
        "algo-1"
    ],
    "hyperparameters": {
        "learning-rate": 0.1
    },
    "input_config_dir": "/opt/ml/input/config",
    "input_data_config": {
        "test": {
            "TrainingInputMode": "File",
            "S3DistributionType": "FullyReplicated",
            "RecordWrapperType": "None"
        },
        "train": {
            "TrainingInputMode": "File",
            "S3DistributionType": "FullyReplicated",
            "RecordWrapperType": "None"
        }
    },
    "input_dir": "/opt/ml/input",
    "is_master": true,
    "job_name": "mxnet-training-2021-06-09-21-55-39-493",
    "log_level": 20,
    "master_hostname": "algo-1",
    "model_dir": "/opt/ml/model",
    "module_dir": "s3://sagemaker-us-west-2-688520471316/mxnet-mnist-example/code/mxnet-training-2021-06-09-21-55-39-493/source/sourcedir.tar.gz",
    "module_name": "mnist",
    "network_interface_name": "eth0",
    "num_cpus": 4,
    "num_gpus": 0,
    "output_data_dir": "/opt/ml/output/data",
    "output_dir": "/opt/ml/output",
    "output_intermediate_dir": "/opt/ml/output/intermediate",
    "resource_config": {
        "current_host": "algo-1",
        "hosts": [
            "algo-1"
        ],
        "network_interface_name": "eth0"
    },
    "user_entry_point": "mnist.py"
}

Environment variables:

SM_HOSTS=["algo-1"]
SM_NETWORK_INTERFACE_NAME=eth0
SM_HPS={"learning-rate":0.1}
SM_USER_ENTRY_POINT=mnist.py
SM_FRAMEWORK_PARAMS={"sagemaker_parameter_server_enabled":true}
SM_RESOURCE_CONFIG={"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"}
SM_INPUT_DATA_CONFIG={"test":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"},"train":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}}
SM_OUTPUT_DATA_DIR=/opt/ml/output/data
SM_CHANNELS=["test","train"]
SM_CURRENT_HOST=algo-1
SM_MODULE_NAME=mnist
SM_LOG_LEVEL=20
SM_FRAMEWORK_MODULE=sagemaker_mxnet_container.training:main
SM_INPUT_DIR=/opt/ml/input
SM_INPUT_CONFIG_DIR=/opt/ml/input/config
SM_OUTPUT_DIR=/opt/ml/output
SM_NUM_CPUS=4
SM_NUM_GPUS=0
SM_MODEL_DIR=/opt/ml/model
SM_MODULE_DIR=s3://sagemaker-us-west-2-688520471316/mxnet-mnist-example/code/mxnet-training-2021-06-09-21-55-39-493/source/sourcedir.tar.gz
SM_TRAINING_ENV={"additional_framework_parameters":{"sagemaker_parameter_server_enabled":true},"channel_input_dirs":{"test":"/opt/ml/input/data/test","train":"/opt/ml/input/data/train"},"current_host":"algo-1","framework_module":"sagemaker_mxnet_container.training:main","hosts":["algo-1"],"hyperparameters":{"learning-rate":0.1},"input_config_dir":"/opt/ml/input/config","input_data_config":{"test":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"},"train":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}},"input_dir":"/opt/ml/input","is_master":true,"job_name":"mxnet-training-2021-06-09-21-55-39-493","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://sagemaker-us-west-2-688520471316/mxnet-mnist-example/code/mxnet-training-2021-06-09-21-55-39-493/source/sourcedir.tar.gz","module_name":"mnist","network_interface_name":"eth0","num_cpus":4,"num_gpus":0,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"},"user_entry_point":"mnist.py"}
SM_USER_ARGS=["--learning-rate","0.1"]
SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
SM_CHANNEL_TEST=/opt/ml/input/data/test
SM_CHANNEL_TRAIN=/opt/ml/input/data/train
SM_HP_LEARNING-RATE=0.1
PYTHONPATH=/usr/local/bin:/usr/local/lib/python36.zip:/usr/local/lib/python3.6:/usr/local/lib/python3.6/lib-dynload:/usr/local/lib/python3.6/site-packages

Invoking script with the following command:

/usr/local/bin/python3.6 -m mnist --learning-rate 0.1


INFO:root:Epoch[0] Batch [0-100]#011Speed: 47287.58 samples/sec#011accuracy=0.110990
INFO:root:Epoch[0] Batch [100-200]#011Speed: 54047.38 samples/sec#011accuracy=0.110000
INFO:root:Epoch[0] Batch [200-300]#011Speed: 55089.63 samples/sec#011accuracy=0.110500
INFO:root:Epoch[0] Batch [300-400]#011Speed: 51203.88 samples/sec#011accuracy=0.111900
INFO:root:Epoch[0] Batch [400-500]#011Speed: 53620.52 samples/sec#011accuracy=0.114500
INFO:root:Epoch[0] Train-accuracy=0.131200
INFO:root:Epoch[0] Time cost=1.200
INFO:root:Epoch[0] Validation-accuracy=0.363200
INFO:root:Epoch[1] Batch [0-100]#011Speed: 46955.54 samples/sec#011accuracy=0.474455
INFO:root:Epoch[1] Batch [100-200]#011Speed: 48841.06 samples/sec#011accuracy=0.662100
INFO:root:Epoch[1] Batch [200-300]#011Speed: 54751.28 samples/sec#011accuracy=0.766200
INFO:root:Epoch[1] Batch [300-400]#011Speed: 51805.90 samples/sec#011accuracy=0.800000
INFO:root:Epoch[1] Batch [400-500]#011Speed: 53854.95 samples/sec#011accuracy=0.828500
INFO:root:Epoch[1] Train-accuracy=0.728033
INFO:root:Epoch[1] Time cost=1.141
INFO:root:Epoch[1] Validation-accuracy=0.841800
INFO:root:Epoch[2] Batch [0-100]#011Speed: 33069.71 samples/sec#011accuracy=0.853960
INFO:root:Epoch[2] Batch [100-200]#011Speed: 39941.72 samples/sec#011accuracy=0.876500
INFO:root:Epoch[2] Batch [200-300]#011Speed: 39199.43 samples/sec#011accuracy=0.885800
INFO:root:Epoch[2] Batch [300-400]#011Speed: 51825.04 samples/sec#011accuracy=0.896700
INFO:root:Epoch[2] Batch [400-500]#011Speed: 54412.47 samples/sec#011accuracy=0.904300
INFO:root:Epoch[2] Train-accuracy=0.887317
INFO:root:Epoch[2] Time cost=1.366
INFO:root:Epoch[2] Validation-accuracy=0.910200
INFO:root:Epoch[3] Batch [0-100]#011Speed: 44400.05 samples/sec#011accuracy=0.920594
INFO:root:Epoch[3] Batch [100-200]#011Speed: 53323.57 samples/sec#011accuracy=0.926800
INFO:root:Epoch[3] Batch [200-300]#011Speed: 58114.34 samples/sec#011accuracy=0.930900
INFO:root:Epoch[3] Batch [300-400]#011Speed: 58136.58 samples/sec#011accuracy=0.928700
INFO:root:Epoch[3] Batch [400-500]#011Speed: 53841.67 samples/sec#011accuracy=0.933100
INFO:root:Epoch[3] Train-accuracy=0.929700
INFO:root:Epoch[3] Time cost=1.247
INFO:root:Epoch[3] Validation-accuracy=0.938600
INFO:root:Epoch[4] Batch [0-100]#011Speed: 48411.49 samples/sec#011accuracy=0.943168
INFO:root:Epoch[4] Batch [100-200]#011Speed: 53567.72 samples/sec#011accuracy=0.944800
INFO:root:Epoch[4] Batch [200-300]#011Speed: 50246.77 samples/sec#011accuracy=0.941500
INFO:root:Epoch[4] Batch [300-400]#011Speed: 44958.83 samples/sec#011accuracy=0.947900
INFO:root:Epoch[4] Batch [400-500]#011Speed: 41663.27 samples/sec#011accuracy=0.948200
INFO:root:Epoch[4] Train-accuracy=0.946000
INFO:root:Epoch[4] Time cost=1.477
INFO:root:Epoch[4] Validation-accuracy=0.943700
INFO:root:Epoch[5] Batch [0-100]#011Speed: 42967.82 samples/sec#011accuracy=0.951980
INFO:root:Epoch[5] Batch [100-200]#011Speed: 55124.16 samples/sec#011accuracy=0.958500
INFO:root:Epoch[5] Batch [200-300]#011Speed: 41605.54 samples/sec#011accuracy=0.957700
INFO:root:Epoch[5] Batch [300-400]#011Speed: 39872.05 samples/sec#011accuracy=0.958000
INFO:root:Epoch[5] Batch [400-500]#011Speed: 40017.75 samples/sec#011accuracy=0.957300
INFO:root:Epoch[5] Train-accuracy=0.957117
INFO:root:Epoch[5] Time cost=1.405
INFO:root:Epoch[5] Validation-accuracy=0.959900
INFO:root:Epoch[6] Batch [0-100]#011Speed: 26208.98 samples/sec#011accuracy=0.963168
INFO:root:Epoch[6] Batch [100-200]#011Speed: 37885.71 samples/sec#011accuracy=0.960500
INFO:root:Epoch[6] Batch [200-300]#011Speed: 45689.64 samples/sec#011accuracy=0.963600
INFO:root:Epoch[6] Batch [300-400]#011Speed: 56166.17 samples/sec#011accuracy=0.967000
INFO:root:Epoch[6] Batch [400-500]#011Speed: 51141.88 samples/sec#011accuracy=0.966300
INFO:root:Epoch[6] Train-accuracy=0.964417
INFO:root:Epoch[6] Time cost=1.438
INFO:root:Epoch[6] Validation-accuracy=0.961900
INFO:root:Epoch[7] Batch [0-100]#011Speed: 47852.92 samples/sec#011accuracy=0.970693
INFO:root:Epoch[7] Batch [100-200]#011Speed: 53753.14 samples/sec#011accuracy=0.966600
INFO:root:Epoch[7] Batch [200-300]#011Speed: 52497.57 samples/sec#011accuracy=0.968000
INFO:root:Epoch[7] Batch [300-400]#011Speed: 42965.93 samples/sec#011accuracy=0.966300
INFO:root:Epoch[7] Batch [400-500]#011Speed: 53491.00 samples/sec#011accuracy=0.970200
INFO:root:Epoch[7] Train-accuracy=0.968967
INFO:root:Epoch[7] Time cost=1.200
INFO:root:Epoch[7] Validation-accuracy=0.964300
INFO:root:Epoch[8] Batch [0-100]#011Speed: 39102.81 samples/sec#011accuracy=0.975347
INFO:root:Epoch[8] Batch [100-200]#011Speed: 49170.00 samples/sec#011accuracy=0.969900
INFO:root:Epoch[8] Batch [200-300]#011Speed: 58197.32 samples/sec#011accuracy=0.973700
INFO:root:Epoch[8] Batch [300-400]#011Speed: 49250.02 samples/sec#011accuracy=0.970600
INFO:root:Epoch[8] Batch [400-500]#011Speed: 47072.32 samples/sec#011accuracy=0.972700
INFO:root:Epoch[8] Train-accuracy=0.972783
INFO:root:Epoch[8] Time cost=1.240
INFO:root:Epoch[8] Validation-accuracy=0.968300
INFO:root:Epoch[9] Batch [0-100]#011Speed: 44996.50 samples/sec#011accuracy=0.978812
INFO:root:Epoch[9] Batch [100-200]#011Speed: 53172.89 samples/sec#011accuracy=0.972600
INFO:root:Epoch[9] Batch [200-300]#011Speed: 58175.85 samples/sec#011accuracy=0.976500
INFO:root:Epoch[9] Batch [300-400]#011Speed: 58447.10 samples/sec#011accuracy=0.975300
INFO:root:Epoch[9] Batch [400-500]#011Speed: 58023.90 samples/sec#011accuracy=0.976400
INFO:root:Epoch[9] Train-accuracy=0.975617
INFO:root:Epoch[9] Time cost=1.139
INFO:root:Epoch[9] Validation-accuracy=0.965500
2021-06-09 22:00:12,939 sagemaker-containers INFO     Reporting training SUCCESS

2021-06-09 22:00:21 Uploading - Uploading generated training model
2021-06-09 22:00:21 Completed - Training job completed
Training seconds: 142
Billable seconds: 142
CPU times: user 599 ms, sys: 40.2 ms, total: 640 ms
Wall time: 5min 14s

Opimtize your model with Neo API

Neo API allows to optimize our model for a specific hardware type. When calling compile_model() function, we specify the target instance family (C5) as well as the S3 bucket to which the compiled model would be stored.

Important. If the following command result in a permission error, scroll up and locate the value of execution role returned by get_execution_role(). The role must have access to the S3 bucket specified in output_path.

[9]:
output_path = "/".join(mnist_estimator.output_path.split("/")[:-1])
neo_optimize = True
compiled_model = mnist_estimator.compile_model(
    target_instance_family="ml_m4",
    input_shape={"data": [1, 784], "softmax_label": [1]},
    role=role,
    output_path=output_path,
    framework="mxnet",
    framework_version="1.8.0",

)
?..................................................!
Defaulting to the only supported framework/algorithm version: 1.7. Ignoring framework/algorithm version: 1.8.0.

Creating an inference Endpoint

After training, we use the MXNet estimator object to build and deploy an MXNetPredictor. This creates a Sagemaker Endpoint – a hosted prediction service that we can use to perform inference.

The arguments to the deploy function allow us to set the number and type of instances that will be used for the Endpoint. These do not need to be the same as the values we used for the training job. For example, you can train a model on a set of GPU-based instances, and then deploy the Endpoint to a fleet of CPU-based instances. Here we will deploy the model to a single ml.m4.xlarge instance.

[10]:
import io
import numpy as np


def numpy_bytes_serializer(data):
    f = io.BytesIO()
    np.save(f, data)
    f.seek(0)
    return f.read()

serializer = None
if neo_optimize is True:
    serializer = numpy_bytes_serializer

predictor = compiled_model.deploy(
    initial_instance_count=1,
    instance_type='ml.m4.xlarge',
    serializer=serializer

)
------!

After you have finished with this example, remember to delete the prediction endpoint to release the instance(s) associated with it.

[13]:
print("Endpoint name: " + predictor.endpoint_name)
Endpoint name: mxnet-training-ml-m4-2021-06-09-23-15-20-517
[16]:
predictor.delete_endpoint()