Amazon SageMaker Feature Store: How to securely store an image dataset in your Feature Store with a KMS key?

This notebook’s CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

This notebook demonstrates how to securely store a dataset of images into your Feature Store using a KMS key. This is demonstrated using the MNIST dataset.

The example in this notebook starts by retrieving the dataset from an Amazon S3 bucket (you can substitute your own S3 bucket storing your image dataset), and then prepare your dataset for ingestion to an online or offline feature store. We use a Key Management Service (KMS) key for server-side encryption to ensure that your data is securely stored in your feature store. Finally, we query the ingested dataset from your feature store and then demonstrate how to retrieve your image dataset.

This notebook uses KMS key for server side encryption for your Feature Store. For more information on server-side encryption, see Feature Store: Encrypt Data in your Online or Offline Feature Store using KMS key.

If you would like to encrypt your data on the client side prior to ingestion, see Amazon SageMaker Feature Store: Client-side Encryption using AWS Encryption SDK for a demonstration.

Overview

Set up
Load in your image data set
Create Feature Groups and ingest your encrypted data into them
Query your data in your feature store using Amazon Athena
Plot your image data set

Prerequisites

This notebook uses the Python SDK library for Feature Store, and the Python 3 (Data Science) kernel. To encrypt your data with KMS key for server side encryption, you will need to have an active KMS key. If you do not have a KMS key, then you can create one by following the KMS Policy Template steps, or you can visit the KMS section in the console and follow the button prompts for creating a KMS key. This notebook is compatible with SageMaker Studio, Jupyter, and JupyterLab.

Library Dependencies:

sagemaker>=2.0.0
numpy
pandas
boto3

Data

This notebook uses the MNIST dataset.

[ ]:

from time import gmtime, strftime
from sagemaker.feature_store.feature_group import FeatureGroup

import sagemaker
import boto3
import pandas as pd
import numpy as np
import pickle
import gzip
import time
import ast
import matplotlib.pyplot as plt
import os.path

Set up

[ ]:

sagemaker_session = sagemaker.Session()
s3_bucket_name = sagemaker_session.default_bucket()  # This is the bucket for your offline store.
public_s3_bucket_name = f"sagemaker-example-files-prod-{sagemaker_session.boto_region_name}"  # This is the name of the public S3 bucket.
prefix = "sagemaker-featurestore-demo"
role = sagemaker.get_execution_role()
region = sagemaker_session.boto_region_name

Download MNIST

We are using the MNIST data set. It is stored on a publically available S3 bucket. Below is a method to download a file to your current working directory. We use it to download the MNIST data set from our public S3 bucket that already has the data.

[ ]:

def download_file_from_s3(bucket, path, filename):
    """
    Download filename to your current directory.
    Parameters:
        bucket: S3 bucket name
        path: path to file
        filename: the name of the file you are downloading
    Returns:
        None
    """
    if not os.path.exists(filename):
        s3 = boto3.client("s3", region_name="us-east-1")
        s3.download_file(Bucket=bucket, Key=path, Filename=filename)


download_file_from_s3(
    public_s3_bucket_name, path="datasets/image/MNIST/mnist.pkl.gz", filename="mnist.pkl.gz"
)

Additional - Helper Method

Below is a method that you can use to get images from your S3 bucket into a numpy array. Specifically, if you have jpg or jpeg images in a S3 bucket that you want to load directly into a numpy array, then you can provide the bucket name, s3_bucket_name, and prefix path, prefix_path to load_images_into_array which does just this. Note: This is an additional method that you can use, but we do not use it in this notebook.

[ ]:

def load_images_into_array(s3_bucket_name, prefix_path):
    """
    Return a numpy array of images.
    Parameters:
        s3_bucket_name: S3 bucket name
        prefix_path: path to images in your S3 bucket
    Returns:
        Numpy array.
    """
    s3 = boto3.resource("s3")
    bucket = s3.Bucket(s3_bucket_name)

    def s3_get_image_paths(bucket, prefix_path, img_exts=["jpg", "jpeg"]):
        """
        Return a list of paths of images.
        Parameters:
            bucket: S3 bucket name
            prefix_path: path to images in your S3 bucket
            img_exts: image extentions
        Returns:
            A list of paths to images.
        """
        img_path_lst = []
        for _ in bucket.objects.filter(Prefix=prefix_path):
            if _.key.endswith(tuple(img_exts)):
                img_path_lst.append(_.key)
        return img_path_lst

    img_path_lst = s3_get_image_paths(bucket, prefix_path)

    lst = []
    for _ in img_path_lst:
        object = bucket.Object(_)
        response = object.get()
        file_stream = response["Body"]
        lst.append(np.array(Image.open(file_stream)))
    return np.array(lst)


# Below demonstrates how to use this method.
# img_lst = load_images_into_array(s3_bucket_name, prefix_path=image_path)

Unzip and load in dataset

[ ]:

with gzip.open("mnist.pkl.gz", "rb") as f:
    train_set, validation_set, test_set = pickle.load(f, encoding="latin1")

[ ]:

train_x, train_y = train_set
# Reshape the image so it can be plotted
train_x = train_x.reshape(train_x.shape[0], 28, 28)

In the following example, we plot a single image.

[ ]:

plt.imshow(train_x[0])
plt.show()

Create a data frame of our images.

We represent the image as a flattened array and also store the original shape of the image in our data frame. Both will be in our data frame that will be ingested into your feature store.

Important: At this time, Feature store only supports flattened images with maximum length 350k.

[ ]:

def create_data_frame(img_lst, col_names=["img", "shape"]):
    """
    Return a Pandas data frame where each row corresponds to an
    image represented as an array, the original shape of that image and an id.
    Parameters:
        img_lst: a list of images.
        col_names: names of the columns in your data frame
    Returns:
        Pandas data frame.

    """
    img_col = []
    img_shape_col = []
    ids = []

    for index, img in enumerate(img_lst):
        img_flat = img.reshape(-1)
        img_as_str = str(
            np.array2string(img_flat, precision=2, separator=",", suppress_small=True)
        ).encode("utf-8")
        img_shape = list(img.shape)
        img_col.append(img_as_str)
        img_shape_col.append(img_shape)
        ids.append(index)

    return pd.DataFrame({"id": ids, col_names[0]: img_col, col_names[1]: img_shape_col})


df = create_data_frame(train_x[:5])

[ ]:

df.head()

[ ]:

df.dtypes

[ ]:

def cast_object_to_string(data_frame):
    """
    Cast all columns of data_frame of type object to type string and return it.
    Parameters:
        data_frame: A pandas Dataframe
    Returns:
        Data frame
    """
    for label in data_frame.columns:
        if data_frame.dtypes[label] == object:
            data_frame[label] = data_frame[label].astype("str").astype("string")
    return data_frame

[ ]:

# Cast columns of df of type object to string.
df = cast_object_to_string(df)

# Get rid of newlines so it can be ingested into the feature group later
df.img = df.img.str.replace("\\n ", "")

[ ]:

df.head()

Create your Feature Group and Ingest your data into it

Below we start by appending the EventTime feature to your data to timestamp entries, then we load the feature definition, and instantiate the Feature Group object. Then lastly we ingest the data into your feature store.

[ ]:

feature_group_name = "mnist-feature-group-" + strftime("%d-%H-%M-%S", gmtime())

Instantiate a FeatureGroup object for your data.

[ ]:

feature_group = FeatureGroup(name=feature_group_name, sagemaker_session=sagemaker_session)

[ ]:

record_identifier_feature_name = "id"

Append the EventTime feature to your data frame. This parameter is required, and time stamps each data point.

[ ]:

current_time_sec = int(round(time.time()))
event_time_feature_name = "EventTime"
# append EventTime feature
df[event_time_feature_name] = pd.Series([current_time_sec] * len(df), dtype="float64")

Load Feature Definition’s of your data into your feature group.

[ ]:

feature_group.load_feature_definitions(data_frame=df)

Create your feature group.

Important: You will need to substitute your KMS Key ARN for kms_key for server side encryption (SSE). The cell below demonstrates how to enable SSE for an offline store. If you choose to use an online store, you will need to assign enable_online_store to True. To enable SSE for an online store you will need to assign online_store_kms_key_id to your KMS key.

[ ]:

feature_group.create(
    s3_uri=f"s3://{s3_bucket_name}/{prefix}",
    record_identifier_name=record_identifier_feature_name,
    event_time_feature_name="EventTime",
    role_arn=role,
    enable_online_store=False,
    offline_store_kms_key_id=kms_key,  # Substitute kms_key with your kms key.
)

[ ]:

feature_group.describe()

Continually check your offline store until your data is available in it.

[ ]:

def check_feature_group_status(feature_group):
    """
    Print when the feature group has been successfully created
    Parameters:
        feature_group: FeatureGroup
    Returns:
        None
    """
    status = feature_group.describe().get("FeatureGroupStatus")
    while status == "Creating":
        print("Waiting for Feature Group to be Created")
        time.sleep(5)
        status = feature_group.describe().get("FeatureGroupStatus")
    print(f"FeatureGroup {feature_group.name} successfully created.")


check_feature_group_status(feature_group)

Ingest your data into your feature group.

[ ]:

feature_group.ingest(data_frame=df, max_workers=5, wait=True)

[ ]:

time.sleep(30)

[ ]:

s3_client = sagemaker_session.boto_session.client("s3", region_name=region)

feature_group_s3_uri = (
    feature_group.describe()
    .get("OfflineStoreConfig")
    .get("S3StorageConfig")
    .get("ResolvedOutputS3Uri")
)

feature_group_s3_prefix = feature_group_s3_uri.replace(f"s3://{s3_bucket_name}/", "")
offline_store_contents = None
while offline_store_contents is None:
    objects_in_bucket = s3_client.list_objects(
        Bucket=s3_bucket_name, Prefix=feature_group_s3_prefix
    )
    if "Contents" in objects_in_bucket and len(objects_in_bucket["Contents"]) > 1:
        offline_store_contents = objects_in_bucket["Contents"]
    else:
        print("Waiting for data in offline store...\n")
        time.sleep(60)

print("Data available.")

Use Amazon Athena to Query your Encrypted Data in your Feature Store

Using Amazon Athena, we query the image data set that we stored in our feature store to demonstrate how to extract your data set of images.

[ ]:

query = feature_group.athena_query()
table = query.table_name
query_table = 'SELECT * FROM "' + table + '"'
print("Running " + query_table)
# Run the Athena query
query.run(
    query_string=query_table,
    output_location="s3://" + s3_bucket_name + "/" + prefix + "/query_results/",
)

[ ]:

time.sleep(60)

[ ]:

dataset = query.as_dataframe()

[ ]:

print(dataset.dtypes)

Below is the data queried from your feature store.

[ ]:

dataset

[ ]:

def parse_show_image(df):
    """
    Return a numpy array of your images that have been reshaped into it's corresponding shape.
    Parameters:
        df: dataframe of your data
    Returns:
        Numpy array
    """
    import ast

    images = []
    for index, entry in enumerate(np.array(df["img"])):
        entry = entry.strip("b").strip("'").replace("\\n", "")
        entry = np.array(ast.literal_eval(entry))
        shape = ast.literal_eval(df["shape"][index])
        entry = entry.reshape(shape[0], shape[1])
        images.append(entry)
    return np.array(images)


images = parse_show_image(dataset)

[ ]:

# Below shows the shape of your image data set.
images.shape

Plot the images to demonstrate that you can view the images stored in your feature store.

[ ]:

for img in images:
    plt.imshow(img)
    plt.show()

Clean up resources

Remove the Feature Group that was created.

[ ]:

feature_group.delete()

Next steps

In this notebook we covered how to securely store data sets of images in a feature store using KMS key.

If you are interested in understanding more on how server-side encryption is done with Feature Store, see Feature Store: Encrypt Data in your Online or Offline Feature Store using KMS key.

If you are interested in understanding how to do client-side encryption to encrypt your image data set prior to storing it in your feature store, see Amazon SageMaker Feature Store: Client-side Encryption using AWS Encryption SDK. For more information on the AWS Encryption library, see AWS Encryption SDK library.

For detailed information about Feature Store, see the Developer Guide.

Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.