Evaluating ML models from AWS Marketplace for person counting use case


It is important to understand the bias as well as the limitations of an ML model before using it in production. To understand the ML model behavior, you need to perform a deep evaluation activity, which includes analyzing different metrics and understanding the ML model performance under different edge conditions.

NOTE- Note that this notebook needs to be run from a ml.c5.xlarge instance with at least 50 GB of disk space.

This sample notebook shows how an ML model can be evaluated for a specific use-case. Here is a specific use-case and set of ML models that you would use as part of this notebook: * Use-case - Person counting use-case. * Requirement - An aerial camera needs to count people at a location to ensure that the location is not over-crowded. * Goal - Understand the performance of the ML model. * ML Model(s) to be evaluated: - GluonCV YOLOv3 Object Detector - GluonCV SSD Object Detector

*Disclaimer: * GluonCV models are open-source and can be used outside of AWS Marketplace.

Why person counting use-case? In the field of computer vision, person counting use case is becoming prevalent. Real world use cases include: security camera footage analysis in high security locations such as airports, warehouse and construction, aerial footage for a big mass event, or for monitoring systems which generate alarms whenever a location has more than allotted capacity.

Table of Contents: 1. Pre-requisites 1. Subscribe to the models 3. Download and Analyze dataset 1. Analyzing annotations 5. Deploy SageMaker endpoints 6. Perform inferences 7. ML Model evaluation 1. Generic data evaluation 2. Few people in the frame evaluation 3. Medium crowd evaluation 4. Edge cases 1. Large crowd 2. People wearing costumes 3. People with pets 4. People wearing masks 5. People far away from camera 6. People facing away from camera 8. ML model evaluation summary 9. Cleanup


[ ]:
%matplotlib inline

import json
import sagemaker
from src.model_package_arns import ModelPackageArnProvider
from sagemaker import get_execution_role
from sagemaker import ModelPackage
import boto3
import matplotlib.pyplot as plt
from matplotlib import patches
from PIL import Image
import numpy as np
from io import BytesIO
import pandas as pd
from urllib.parse import urlparse

# The session remembers our connection parameters to Amazon SageMaker. We'll use it to perform all of our Amazon SageMaker operations.
role = get_execution_role()
sagemaker_session = sagemaker.Session()
runtime = sagemaker_session.boto_session.client("sagemaker-runtime")
s3_client = sagemaker_session.boto_session.client("s3")

Subscribe to the models

Before you can deploy the model, your account needs to be subscribed to it. This section covers instructions for populating necessary parameters and for subscribing to the model package, if the subscription does not already exist.

  1. Open the Model Package listing page for the two model packages we will be using:

  1. Read the Product Overview section and Highlights section of the listing to understand the value proposition of the model package.

  2. View Usage Information and then Additional Resources sections. These sections will contain following things:

    1. Input content-type

    2. Sample input file (optional)

    3. Sample Jupyter notebook

    4. Output format

    5. Any additional information.

  3. Click Continue to Subscribe to read the End User License Agreement (EULA) and click Accept offer if you agree.

  4. Click Continue to configuration. Once you choose a region, you will see a Product Arn displayed. This is the model package ARN that you need to specify while creating a deployable model using Boto3. However, for this notebook, the model ARNs have been specified in src/model_package_arns.py file, and you need not specify them explicitly. The configuration page also provides a “View in Amazon SageMaker” button to navigate to Amazon SageMaker to deploy via Amazon SageMaker Console.

[ ]:
# Both models support the same instance type and content type recommended, so we can set single variables for both:
instance_type = "ml.m4.xlarge"
content_type = "image/jpeg"

# Type in the region selected during the model configurations
region = "us-west-2"

# For some inferences we will be using Batch Transform, which requires you to assign an S3 bucket for input storage.
# Create a bucket first, and then provide the name below.
bucket = "mlmp-person-counting-bucket-349872034"

Congratulations! You have identified the necessary information to create an endpoint for performing real-time inference.

Download and Analyze dataset

In this notebook, we will use COCO public dataset to measure the performance.

COCO is a large image dataset designed for object detection, segmentation, person key points detection, stuff segmentation, and caption generation. This package provides Matlab, Python, and Lua APIs that assists in loading, parsing, and visualizing the annotations in COCO. Please visit http://cocodataset.org/ for more information on COCO, including for the data, paper, and tutorials. The exact format of the annotations is also described on the COCO website.

We will use the 2017 Train images. The following cells will download, unzip, and place the images in coco/images/ location and annotations in coco/annotations/ location.

Note - The following cell downloads several gigabytes of data and can take up to 20 minutes to complete.

[ ]:
wget -q -c http://images.cocodataset.org/zips/train2017.zip -P coco/images/
wget -q http://images.cocodataset.org/annotations/annotations_trainval2017.zip -P coco/annotations/
wget -q http://images.cocodataset.org/annotations/stuff_annotations_trainval2017.zip -P coco/annotations/

unzip -q coco/images/train2017.zip
unzip -q coco/annotations/annotations_trainval2017.zip
unzip -q coco/annotations/stuff_annotations_trainval2017.zip

rm coco/images/train2017.zip
mv train2017 coco/images/

Analyze data and annotations

Now that you have downloaded the dataset containing people, you can now load the annotations corresponding to images.

[ ]:
# Load metadata json into metadata variable.
with open("./coco/annotations/instances_train2017.json", "r") as f:
    metadata = json.load(f)

# Annotations contain specific rectangular segmentation within an image that's been labeled with one of the categories
annotations = metadata["annotations"]

# images contain metadata about an image (url, size, file name, etc.)
images = metadata["images"]

dataset_and_label = {}

# Aggregate the count of people found in a picture
for annotation in annotations:
    image_id = annotation["image_id"]
    if image_id not in dataset_and_label:
        dataset_and_label[image_id] = 0
    # catagory_id=1 refers to presence of a unique person in the picture.
    if annotation["category_id"] == 1:
        dataset_and_label[image_id] += 1

# image_ids contains list of images
image_ids = list(dataset_and_label.keys())
# counts contains the information about number of people found in an image.
counts = list(dataset_and_label.values())

image_directory = "./coco/images/train2017/"

df = pd.DataFrame(list(zip(image_ids, counts)), columns=["image_id", "people_count"])
df["file_paths"] = image_directory + df["image_id"].astype("str").str.zfill(12) + ".jpg"
[ ]:
# Lets see some sample rows from the dataframe
[ ]:
# following function displays the image based on the id
def show_image(image_id):
    file_name = df[df["image_id"] == image_id]["file_paths"].values[0]
    with Image.open(file_name) as im:

[ ]:
# You can see that the picture has 14 people. The annotations shows us the same.
print("The image id " + str(image_ids[2]) + " has " + str(counts[2]) + " people.")

Next, you can see general statistics about the dataframe, including the distribution of number of people in all the images. This insight will be helpful when comparing the two models later. As you can see from the histogram below, there aren’t many images with a lot of people (with max being 15).

[ ]:
def plot_count_summary(df):
    print("The most number of people in an image is: ", df["people_count"].max())
        "There are on average " + str(round(df["people_count"].mean(), 2)) + " people in an image."

    df.hist(column="people_count", bins=list(range(0, df["people_count"].max())))


Deploy SageMaker endpoints

Having already subscribed to the endpoints in previous steps, we can proceed to deploying both the models for real-time inference and batch transform.


  1. The endpoint deployment will take about 15 minutes to finish.

  2. We are making use of 4 instances per endpoint. Due to the size of the samples, using 1 instance will take batch transform over 70 minutes to finish, whereas 4 instances will take no more than 35 minutes.

[ ]:
instance_count = 4
yolo_name = "yolov3-endpoint"
gluoncvssd_name = "gluoncvssd-endpoint"

gluoncvssd_model_package_arn = ModelPackageArnProvider.get_ssd_model_package_arn(region)
yolov3_model_package_arn = ModelPackageArnProvider.get_yolov3_model_package_arn(region)
[ ]:
def deploy_model(num_instances, model_arn, instance_type, model_name):
    model = ModelPackage(
        role=role, model_package_arn=model_arn, sagemaker_session=sagemaker_session
    model.deploy(num_instances, instance_type, endpoint_name=model_name)
    transformer = model.transformer(num_instances, instance_type, max_concurrent_transforms=2)

    return model, transformer

# Deploy Gluon SSD object detector ML model
ssd_model, ssd_batch = deploy_model(
    instance_count, gluoncvssd_model_package_arn, instance_type, gluoncvssd_name

# Deploy Gluon YoloV3 object detector ML model
yolov3_model, yolov3_batch = deploy_model(
    instance_count, yolov3_model_package_arn, instance_type, yolo_name

Perform inferences

Since there are more than 100k images in the COCO dataset, we will sample 5000 of them for endpoint inference analysis.

[ ]:
df_sample = df.sample(n=5000)
df_sample["file_name_id"] = df_sample["image_id"].astype("str").str.zfill(12)

In a 5000 sample from the dataset, we would expect the mean and the distribution to be similar, assuming sampling was done randomly.

[ ]:
[ ]:
# This function loops through the array and counts number of person occurrences
# and collects rectangular coordinates needed for drawing bounding boxes
def count_people(data, bounding_box="no"):
    counter = 0
    coordinates = []
    for item in data:
        if item["id"] == "person" and item["score"] >= 0.2:
            counter += 1
            if bounding_box == "yes":
                        item["right"] - item["left"],
                        item["bottom"] - item["top"],
    return counter, coordinates

# This function performs real time inference on payload and returns people count
def invoke_DL_endpoint_and_count_people(image_path, runtime, endpoint_name, bounding_box="no"):
    img = open(image_path, "rb").read()

    response = runtime.invoke_endpoint(
        CustomAttributes='{"threshold": 0.2}',
    result = json.loads(response["Body"].read().decode("utf-8"))

    return count_people(result, bounding_box)

# This function performs batch transform on payload and returns the output path
def batch_transform(data, transformer, content_type):
    transformer.transform(data=data, data_type="S3Prefix", content_type=content_type)

    output = transformer.output_path

    return output

# This function returns people count on batch transform output
def batch_count_people(file_name_id, output_path, s3_client):
    parsed_url = urlparse(output_path)
    bucket_name = parsed_url.netloc

    file_key = "{}/{}.out".format(parsed_url.path[1:], file_name_id + ".jpg")

    response = s3_client.get_object(Bucket=sagemaker_session.default_bucket(), Key=file_key)
    response_bytes = json.loads(response["Body"].read().decode("utf-8"))

    return count_people(response_bytes)
[ ]:

To make inferences from the sample dataset, we will use batch transform. Here are the steps:

  1. Upload the data to S3. (This will take about 10 minutes to finish)

  2. Use the transformer and call the transform function.

  3. Output is sent to another S3 location in batch, whose path we will retrieve.

  4. Using the path, we can determine how many people appear in each image and add the result to our dataframe.

[ ]:
# This uploads all the images to an S3 bucket you have previously created.
sample_image_list = df_sample.file_paths.tolist()
input_list = []
for item in sample_image_list:
    input_list.append(sagemaker_session.upload_data(item, bucket=bucket))
[ ]:
ssd_output_path = batch_transform("s3://{}/data".format(bucket), ssd_batch, content_type)
[ ]:
df_sample["ssd"] = df_sample["file_name_id"].apply(
    batch_count_people, output_path=ssd_output_path, s3_client=s3_client
[ ]:
yolov3_output_path = batch_transform("s3://{}/data".format(bucket), yolov3_batch, content_type)
[ ]:
df_sample["yolov3"] = df_sample["file_name_id"].apply(
    batch_count_people, output_path=yolov3_output_path, s3_client=s3_client

Note the difference in batch transform job duration between GluonCV SSD (20 minutes) and YoloV3 (35 minutes).

We see that the columns have been added on our 5000-sample dataframe to show the number of people counted using both endpoints.

[ ]:

ML Model evaluation

Generic data evaluation

Using the sample dataset from COCO, we can look at how many images each endpoint had incorrectly predicted the number of people, and also calculate the accuracy:

[ ]:
df_ssd_bad = df_sample[df_sample["people_count"] != df_sample["ssd"]]
df_yolov3_bad = df_sample[df_sample["people_count"] != df_sample["yolov3"]]

print("Number of incorrectly predicted images using GluonCV SSD: ", df_ssd_bad.shape[0])
print("Number of incorrectly predicted images using GluonCV YoloV3: ", df_yolov3_bad.shape[0])
    "People counting accuracy of YoloV3 endpoint: ",
    1 - (df_yolov3_bad.shape[0] / df_sample.shape[0]),
    "People counting accuracy of GluonCV SSD endpoint: ",
    1 - (df_ssd_bad.shape[0] / df_sample.shape[0]),

Another metric we can look at is Mean Absolute Error (MAE).

[ ]:
def get_MAE(df_in, true_col, pred_col):
    total_err = 0
    for index, row in df_in.iterrows():
        total_err += abs(row[pred_col] - row[true_col])

    return total_err / len(df_in.index)
[ ]:
print("Mean Absolute Error for the GluonCV SSD: ", get_MAE(df_sample, "people_count", "ssd"))
print("Mean Absolute Error for the GluonCV YoloV3: ", get_MAE(df_sample, "people_count", "yolov3"))

Few people in the frame evaluation:

How do the models perform on crowds of 2 to 5 people?

[ ]:
df_2_5 = df_sample[(df_sample["people_count"] >= 2) & (df_sample["people_count"] <= 5)]
df_ssd_bad_2_5 = df_2_5[df_2_5["people_count"] != df_2_5["ssd"]]
df_yolov3_bad_2_5 = df_2_5[df_2_5["people_count"] != df_2_5["yolov3"]]

print("Number of incorrectly predicted images using GluonCV SSD: ", df_ssd_bad_2_5.shape[0])
print("Number of incorrectly predicted images using GluonCV YoloV3: ", df_yolov3_bad_2_5.shape[0])
    "People counting accuracy of YoloV3 endpoint: ",
    1 - (df_yolov3_bad_2_5.shape[0] / df_2_5.shape[0]),
    "People counting accuracy of GluonCV SSD endpoint: ",
    1 - (df_ssd_bad_2_5.shape[0] / df_2_5.shape[0]),

Let’s calculate MAE for 2-5 people range.

[ ]:
print("Mean Absolute Error for the GluonCV SSD: ", get_MAE(df_2_5, "people_count", "ssd"))
print("Mean Absolute Error for the GluonCV YoloV3: ", get_MAE(df_2_5, "people_count", "yolov3"))

Medium crowd evaluation:

How do the models perform on crowds of 10 to 14 people? (The COCO dataset has a max of about 14 people in a photo, with a few that have 15)

[ ]:
df_med = df_sample[df_sample["people_count"] >= 10]
[ ]:
df_ssd_bad_med = df_med[df_med["people_count"] != df_med["ssd"]]
df_yolov3_bad_med = df_med[df_med["people_count"] != df_med["yolov3"]]

print("Number of incorrectly predicted images using GluonCV SSD: ", df_ssd_bad_med.shape[0])
print("Number of incorrectly predicted images using GluonCV YoloV3: ", df_yolov3_bad_med.shape[0])
    "People counting accuracy of YoloV3 endpoint: ",
    1 - (df_yolov3_bad_med.shape[0] / df_med.shape[0]),
    "People counting accuracy of GluonCV SSD endpoint: ",
    1 - (df_ssd_bad_med.shape[0] / df_med.shape[0]),
[ ]:
print("Mean Absolute Error for the GluonCV SSD in crowds: ", get_MAE(df_med, "people_count", "ssd"))
    "Mean Absolute Error for the GluonCV YoloV3 in crowds: ",
    get_MAE(df_med, "people_count", "yolov3"),

CONCLUSION : We can see from the bulk model evaluation that the YoloV3 performs significantly better than SSD in images with less amount of people (< 20), as see with higher accuracy and lower mean absolute error. Let’s see what happens when we look at some images with a larger crowd (over 30).

Since the COCO dataset does not contain any images with a crowd of more than 20, we will have to use other image resources to dive into the edge case.

Edge Cases

The following are individual images from Pexels (pexels.com) that are used to look at special edge cases. These edge cases can give us a more specific analysis around how the models perform (rather than using a metric like MAE or accuracy). Here are some of the examples of these special edge cases:

  1. Large crowd

  2. People in special costumes

  3. People with pets

  4. People wearing masks

  5. Harder to identify photos (people’s back showing, smaller representations, etc.)

If you wish to view a larger version of the image, the image links will be provided for each use case.

As a side-by-side comparison, the following functions will be used to display images that show: - The original image - Image with bounding boxes from GluonCV SSD inferences (in red) - Image with bounding boxes from YoloV3 inferences (in blue)

[ ]:
# This function helps us display bounding boxes of each recognized person from the inferences
def draw_bounding_boxes(image, title="None", coordinates=[], box_color="bl"):
    if title != "None":
        for item in coordinates:
                    (item[0], item[1]),
    plt.title(title, fontsize=6)

# This function will invoke real time inference of both endpoints and display the following:
# regular image, image w/ GluonCV SSD bounding boxes, and image w/ YoloV3 bounding boxes
# last two are displayed only if bounding_box='yes'
def compare_models_against_ground_truth(image_path, bounding_box="no"):
    image = Image.open(image_path)

    gluon_count, gluon_coordinates = invoke_DL_endpoint_and_count_people(
        image_path, runtime, gluoncvssd_name, bounding_box
    yolo_count, yolo_coordinates = invoke_DL_endpoint_and_count_people(
        image_path, runtime, yolo_name, bounding_box

    fig = plt.figure(figsize=(8, 6), dpi=300)
    fig.add_subplot(1, 3, 1)

    if bounding_box == "yes":
        # Display bounding boxes for GluonCV SSD
        fig.add_subplot(1, 3, 2)
        draw_bounding_boxes(image, "GluonCV SSD Bounding Boxes", gluon_coordinates, "r")

        # Display bounding boxes for YoloV3
        fig.add_subplot(1, 3, 3)
        draw_bounding_boxes(image, "YoloV3 Bounding Boxes", yolo_coordinates, "b")

    print("Count from GluonCV SSD: " + str(gluon_count) + " people")
    print("Count from GluonCV YOLOv3: " + str(yolo_count) + " people")

a. Large crowd

In this section, you will evaluate how the ml model performs when there are a large number of people in the picture. As you can see, the following picture is an outdoor picture taken during the day.

Image 1: https://www.flickr.com/photos/ajay_g/9134506074/

We chose this image to see how the endpoints perform using an image where most of the crowd is far from the camera.

[ ]:
compare_models_against_ground_truth("img/Crowd_01.jpg", "yes")

Manual count: 42 people

OBSERVATION: As we can see, GluonCV SSD performed better by the numbers; however, as indicated by the bounding boxes, GluonCV SSD tends to overcount by including duplicates (i.e. there is a bounding box for upper body of the bride AND the full body of the bride).

Image 2: https://www.flickr.com/photos/weltbild-schweiz/5201818898/

We chose this image to see how the endpoints perform using an image that is not well lit and at an angle.

[ ]:
compare_models_against_ground_truth("img/Crowd_02.jpg", "yes")

Manual Count: 175

OBSERVATION: Similar to Image 1, GluonCV SSD did a much better job recognizing despite the darker images; however, it also recognized several with larger bounding boxes that included a group of people rather than individuals. Overall, both failed to recognize many individuals in the middle of the crowd where the lighting was at the dimmest.

Image 3: https://www.pexels.com/photo/people-inside-terminal-983959/

We chose this image to see how the endpoints perform using an image that is frontal view rather than at an angle.

[ ]:
compare_models_against_ground_truth("img/Crowd_03.jpg", "yes")

Manual Count: 209 people

OBSERVATION: Similar to other images, GluonCV SSD did a much better job recognizing than YOLOv3; notice the difference in how far GluonCV SSD is able to recognize versus YOLOv3. Despite bigger bounding boxes again appearing on GluonCV SSD, we can definitively say that GluonCV SSD performed a lot better than YOLOv3 in terms of identifying more people amongst the crowds and further away from where the image was taken.

In a mass public area (such as train stations, airports, malls, etc.), it may make more sense to deploy a surveillance system with GluonCV SSD rather than YOLOv3 for the purposes of people counting.

Let’s explore some other edge cases that may give us other interesting insights.

b. People wearing costumes

In this section, you will evaluate how the ml model performs when humans in the picture are wearing costumes.

Image 1: https://www.pexels.com/photo/ghosts-holding-a-carved-pumpkin-5435309/

We chose this image to see if we can confuse the algorithms to misrecognize or fail to recognize people.

[ ]:
compare_models_against_ground_truth("img/Halloween_Party_01.jpg", "yes")

Manual Count: 3 people

OBSERVATION: With an overall or costume that covers all body parts, we can see that both models fail to recognize the individuals inside the costume.

Image 2: https://www.flickr.com/photos/presidioofmonterey/31751061088

We chose this image to add different depths and angles of people in addition to them wearing different costumes.

[ ]:
compare_models_against_ground_truth("img/Halloween_Party_02.jpg", "yes")

Manual Count: 14 people

OBSERVATION: Both models performed well on this particular image, especially being able to recognize different face paints and depth (i.e. the DJ in the booth).

c. People with pets

In this section, you will evaluate how the ml model performs when both pets and humans are in the picture.

Image 1: https://www.pexels.com/photo/man-in-maroon-t-shirt-playing-with-his-large-short-coated-black-and-brown-dog-1172060/

We chose this image to see how the endpoints perform in an image of mostly empty area.

[ ]:
compare_models_against_ground_truth("img/Pet_Image_01.jpg", "yes")

Manual Count: 1 person

OBSERVATION: Both models recognized correctly.

Image 2: https://www.flickr.com/photos/ajay_g/9076102094/

We chose this image to see how the endpoints perform in an image of a busy street - lots of pets AND people.

[ ]:
compare_models_against_ground_truth("img/Pet_Image_02.jpg", "yes")

Manual Count: 20 people

OBSERVATION: GlounCV SSD was able to successfully recognize more people at the back of the street, behind the person on the scooter, making it more accurate than YOLOv3. Both models performed well, especially as they both were able to recognize the person on the very left, whose body was cut off on the edge of the image.

d. People wearing masks

In this section, you will evaluate how the ml model performs when people are wearing masks.

Image 1: https://www.flickr.com/photos/sbanfield/50915718208/

[ ]:
compare_models_against_ground_truth("img/Mask_01.jpg", "yes")

Manual Count: 3 people

OBSERVATION: Masks did not affect the models’ abilities to recognize people with masks. Both successfully identified people with masks in the image.

Image 2: https://www.pexels.com/photo/people-wearing-face-mask-for-protection-3957986/

We chose this image to see how the endpoints perform when there are manufactured images of humans (i.e. drawings)

[ ]:
compare_models_against_ground_truth("img/Mask_02.jpg", "yes")

Manual Count: 4 people

OBSERVATION: Both were susceptible to any images of human, as Mona Lisa in a picture was incorrectly recognized on both models! Other than that, GlounCV SSD had one individual identified as 2 different individuals. This is in line with the pattern of duplicates on GluonCV SSD.

Image 3: https://www.flickr.com/photos/gauthierdelecroix/50595262743/

We chose this image to see how the endpoints perform when there are reflections of humans (against a window or mirror)

[ ]:
compare_models_against_ground_truth("img/Mask_03.jpg", "yes")

Manual Count: 3 people

OBSERVATION: Both endpoints will identify a reflected person (against a window or a mirror) as an additional person. This is a limitation shown in other edge cases as well.

e. People far away from camera

In this section, you will evaluate how the ml model performs when people are very far away from the camera.

Image 1: https://www.pexels.com/photo/group-of-tourists-walking-on-snowy-hilly-terrain-6805855/

We chose this image to see how the endpoints perform when looking at an image with people very far away.

[ ]:
compare_models_against_ground_truth("img/Snow_01.jpg", "yes")

Manual Count: 5 people

OBSERVATION: Note how well YOLOv3 performs on this particular image. It can identify the 3 rightmost individuals correctly, and can also identify the 2 leftmost individuals (albeit as a single individual). It clearly performed better than GluonCV SSD.

Image 2: https://www.pexels.com/photo/photo-of-camels-on-dessert-3889891/

We chose this image to see how deserts and people riding on camels may affect the performance.

[ ]:
compare_models_against_ground_truth("img/Desert_01.jpg", "yes")

Manual Count: 7 people

OBSERVATION: As seen in this image (and the previous one), YOLOv3 is now able to recognize more people in the images where people are very far away and sparse. In fact, YOLOv3 is the model that’s overcounting in this specific example.

f. People facing away from camera

In this section, you will see how the ML model performs when the humans in the picture are looking away from the camera.

Image 1: https://www.pexels.com/photo/women-standing-near-river-1140854/

We chose this image for colors and the fact that everyone is raising their hands and how that may affect the performance.

[ ]:
compare_models_against_ground_truth("img/Facing_Away_01.jpg", "yes")

Manual Count: 6 people

OBSERVATION: The hands raised had minor affect, as YOLOv3 counted one of the hands as another person. However, for the most part both models were able to correctly recognize all individuals in the image.

Image 2: https://www.pexels.com/photo/group-of-children-walking-near-body-of-water-silhouette-photography-939700/

We chose this image to see how the reflections may have an effect.

[ ]:
compare_models_against_ground_truth("img/Silhouette_01.jpg", "yes")

Manual Count: 7 people

OBSERVATION: The reflections here did not have an effect on either model, and we can attribute that to the wavy reflections of people on the water (versus cleaner reflections we saw against a window or a mirror).

Image 3: https://www.pexels.com/photo/crowd-in-front-of-people-playing-musical-instrument-during-nighttime-196652/

We chose this image for lighting effects.

[ ]:
compare_models_against_ground_truth("img/Audience_01.jpg", "yes")

Manual Count: 33 people

OBSERVATION: As we saw in the mass crowd examples, neither were able to pick up all of the individuals in the crowds, and anyone at a very far distance from where the image was taken.

ML model evaluation summary

See below for a summary of our entire evaluation in a table, comparing the two ML models.

Use Case

GluonCV YoloV3


Sample Batch transform time

35 min

20 min

Sample accuracy



Sample MAE



Small crowd (2-5) accuracy



Small crowd (2-5) MAE



Medium crowd (10-14)



Medium crowd (10-14) MAE



Large crowd (20+)


People in Costumes


People with pets


People wearing masks


Monochromatic pictures


People far away from camera

slightly better

People facing away from camera


Consider using GluonCV SSD for: - Large mass transit area (i.e. Airports, Train Stations), where there will be a lot of people at every depth of your camera - Low-latency inference use cases, as GluonCV SSD inference time is shorter than that of YOLOv3

Consider using YOLOv3 for: - Aerial view that captures people sparse and far away - Images with fewer people

This is conclusion based on our sample images; It is highly recommended that you explore other images to see if one endpoint is better than the other for your particular use case.


Now that we’ve learned how to use AWS Marketplace ML endpoints for an analysis of its performances, it is time to clean up. Here are the resources to delete:

  1. Model, endpoint configurations, and the endpoints,

  2. S3 buckets and bucket items, and

  3. This notebook instance

[ ]:



To delete the S3 buckets and the items from the batch transform: 1. Sign in to the S3 console. 2. In the Buckets list, select the bucket and select Empty. 3. Type permanently delete and select Empty to empty the bucket. 3. Once completed, select the bucket again and this time select Delete. 4. Type the name of the bucket and select Delete bucket to delete the bucket.

Note: Once you delete a bucket, it cannot be undone; since the bucket names are unique, another AWS user can use the name once your bucket is deleted.

Finally, if the AWS Marketplace subscription was created just for the experiment, and you would like to unsubscribe to the product, here are the steps that can be followed. Before you cancel the subscription, ensure that you do not have any deployable model created from the model package or using the algorithm. Note - You can find this information by looking at the container name associated with the model.

Steps to un-subscribe to product from AWS Marketplace: 1. Navigate to Machine Learning tab on **Your Software subscriptions page** 2. Locate the listing that you would need to cancel subscription for, and then Cancel Subscription can be clicked to cancel the subscription.

[ ]: