Amazon SageMaker Feature Store: Client-side Encryption using AWS Encryption SDK

This notebook demonstrates how client-side encryption with SageMaker Feature Store is done using the AWS Encryption SDK library to encrypt your data prior to ingesting it into your Online or Offline Feature Store. We first demonstrate how to encrypt your data using the AWS Encryption SDK library, and then show how to use Amazon Athena to query for a subset of encrypted columns of features for model training.

Currently, Feature Store supports encryption at rest and encryption in transit. With this notebook, we showcase an additional layer of security where your data is encrypted and then stored in your Feature Store. This notebook also covers the scenario where you want to query a subset of encrypted data using Amazon Athena for model training. This becomes particularly useful when you want to store encrypted data sets in a single Feature Store, and want to perform model training using only a subset of encrypted columns, forcing privacy over the remaining columns.

If you are interested in server side encryption with Feature Store, see Feature Store: Encrypt Data in your Online or Offline Feature Store using KMS key.

For more information on the AWS Encryption library, see AWS Encryption SDK library.

For detailed information about Feature Store, see the Developer Guide.

Overview

  1. Set up

  2. Load in and encrypt your data using AWS Encryption library (aws-encryption-sdk)

  3. Create Feature Group and ingest your encrypted data into it

  4. Query your encrypted data in your feature store using Amazon Athena

  5. Decrypt the data you queried

Prerequisites

This notebook uses the Python SDK library for Feature Store, the AWS Encryption SDK library, aws-encryption-sdk and the Python 3 (DataScience) kernel. To use theaws-encryption-sdk library you will need to have an active KMS key that you created. If you do not have a KMS key, then you can create one by following the KMS Policy Template steps, or you can visit the KMS section in the console and follow the button prompts for creating a KMS key. This notebook works with SageMaker Studio, Jupyter, and JupyterLab.

Library Dependencies:

  • sagemaker>=2.0.0

  • numpy

  • pandas

  • aws-encryption-sdk

Data

This notebook uses a synthetic data set that has the following features: customer_id, ssn (social security number), credit_score, age, and aims to simulate a relaxed data set that has some important features that would be needed during the credit card approval process.

[ ]:
import sagemaker
import pandas as pd
import numpy as np
[ ]:
pip install -q 'aws-encryption-sdk'

Set up

[ ]:
sagemaker_session = sagemaker.Session()
s3_bucket_name = sagemaker_session.default_bucket()
prefix = "sagemaker-featurestore-demo"
role = sagemaker.get_execution_role()
region = sagemaker_session.boto_region_name

Instantiate an encryption SDK client and provide your KMS ARN key to the StrictAwsKmsMasterKeyProvider object. This will be needed for data encryption and decryption by the AWS Encryption SDK library. You will need to substitute your KMS Key ARN for kms_key.

[ ]:
import aws_encryption_sdk
from aws_encryption_sdk.identifiers import CommitmentPolicy

client = aws_encryption_sdk.EncryptionSDKClient(
    commitment_policy=CommitmentPolicy.REQUIRE_ENCRYPT_REQUIRE_DECRYPT
)

kms_key_provider = aws_encryption_sdk.StrictAwsKmsMasterKeyProvider(
    key_ids=[kms_key]  ## Add your KMS key here
)

Load in your data.

[ ]:
credit_card_data = pd.read_csv("data/credit_card_approval_synthetic.csv")
[ ]:
credit_card_data.head()
[ ]:
credit_card_data.dtypes

Client-Side Encryption Methods

Below are some methods that use the Amazon Encryption SDK library for data encryption, and decryption. Note that the data type of the encryption is byte which we convert to an integer prior to storing it into Feature Store and do the reverse prior to decrypting. This is because Feature Store doesn’t support byte format directly, thus why we convert the byte encryption to an integer.

[ ]:
def encrypt_data_frame(df, columns):
    """
    Input:
    df: A pandas Dataframe
    columns: A list of column names.

    Encrypt the provided columns in df. This method assumes that column names provided in columns exist in df,
    and uses the AWS Encryption SDK library.
    """
    for col in columns:
        buffer = []
        for entry in np.array(df[col]):
            entry = str(entry)
            encrypted_entry, encryptor_header = client.encrypt(
                source=entry, key_provider=kms_key_provider
            )
            buffer.append(encrypted_entry)
        df[col] = buffer


def decrypt_data_frame(df, columns):
    """
    Input:
    df: A pandas Dataframe
    columns: A list of column names.

    Decrypt the provided columns in df. This method assumes that column names provided in columns exist in df,
    and uses the AWS Encryption SDK library.
    """
    for col in columns:
        buffer = []
        for entry in np.array(df[col]):
            decrypted_entry, decryptor_header = client.decrypt(
                source=entry, key_provider=kms_key_provider
            )
            buffer.append(float(decrypted_entry))
        df[col] = np.array(buffer)


def bytes_to_int(df, columns):
    """
    Input:
    df: A pandas Dataframe
    columns: A list of column names.

    Convert the provided columns in df of type bytes to integers. This method assumes that column names provided
    in columns exist in df and that the columns passed in are of type bytes.
    """
    for col in columns:
        for index, entry in enumerate(np.array(df[col])):
            df[col][index] = int.from_bytes(entry, "little")


def int_to_bytes(df, columns):
    """
    Input:
    df: A pandas Dataframe
    columns: A list of column names.

    Convert the provided columns in df of type integers to bytes. This method assumes that column names provided
    in columns exist in df and that the columns passed in are of type integers.
    """
    for col in columns:
        buffer = []
        for index, entry in enumerate(np.array(df[col])):
            current = int(df[col][index])
            current_bit_length = current.bit_length() + 1  # include the sign bit, 1
            current_byte_length = (current_bit_length + 7) // 8
            buffer.append(current.to_bytes(current_byte_length, "little"))
        df[col] = pd.Series(buffer)
[ ]:
## Encrypt credit card data. Note that we treat `customer_id` as a primary key, and since it's encryption is unique we can encrypt it.
encrypt_data_frame(credit_card_data, ["customer_id", "age", "SSN", "credit_score"])
[ ]:
credit_card_data
[ ]:
print(credit_card_data.dtypes)
[ ]:
## Cast encryption of type bytes to an integer so it can be stored in Feature Store.
bytes_to_int(credit_card_data, ["customer_id", "age", "SSN", "credit_score"])
[ ]:
print(credit_card_data.dtypes)
[ ]:
credit_card_data
[ ]:
def cast_object_to_string(data_frame):
    """
    Input:
    data_frame: A pandas Dataframe

    Cast all columns of data_frame of type object to type string.
    """
    for label in data_frame.columns:
        if data_frame.dtypes[label] == object:
            data_frame[label] = data_frame[label].astype("str").astype("string")
    return data_frame


credit_card_data = cast_object_to_string(credit_card_data)
[ ]:
print(credit_card_data.dtypes)
[ ]:
credit_card_data

Create your Feature Group and Ingest your encrypted data into it

Below we start by appending the EventTime feature to your data to timestamp entries, then we load the feature definition, and instantiate the Feature Group object. Then lastly we ingest the data into your feature store.

[ ]:
from time import gmtime, strftime, sleep

credit_card_feature_group_name = "credit-card-feature-group-" + strftime("%d-%H-%M-%S", gmtime())

Instantiate a FeatureGroup object for credit_card_data.

[ ]:
from sagemaker.feature_store.feature_group import FeatureGroup

credit_card_feature_group = FeatureGroup(
    name=credit_card_feature_group_name, sagemaker_session=sagemaker_session
)
[ ]:
import time

current_time_sec = int(round(time.time()))

## Recall customer_id is encrypted therefore unique, and so it can be used as a record identifier.
record_identifier_feature_name = "customer_id"

Append the EventTime feature to your data frame. This parameter is required, and time stamps each data point.

[ ]:
credit_card_data["EventTime"] = pd.Series(
    [current_time_sec] * len(credit_card_data), dtype="float64"
)
[ ]:
credit_card_data.head()
[ ]:
print(credit_card_data.dtypes)
[ ]:
credit_card_feature_group.load_feature_definitions(data_frame=credit_card_data)
[ ]:
credit_card_feature_group.create(
    s3_uri=f"s3://{s3_bucket_name}/{prefix}",
    record_identifier_name=record_identifier_feature_name,
    event_time_feature_name="EventTime",
    role_arn=role,
    enable_online_store=False,
)
[ ]:
time.sleep(60)

Ingest your data into your feature group.

[ ]:
credit_card_feature_group.ingest(data_frame=credit_card_data, max_workers=3, wait=True)
[ ]:
time.sleep(30)

Continually check your offline store until your data is available in it.

[ ]:
s3_client = sagemaker_session.boto_session.client("s3", region_name=region)

credit_card_feature_group_s3_uri = (
    credit_card_feature_group.describe()
    .get("OfflineStoreConfig")
    .get("S3StorageConfig")
    .get("ResolvedOutputS3Uri")
)

credit_card_feature_group_s3_prefix = credit_card_feature_group_s3_uri.replace(
    f"s3://{s3_bucket_name}/", ""
)
offline_store_contents = None
while offline_store_contents is None:
    objects_in_bucket = s3_client.list_objects(
        Bucket=s3_bucket_name, Prefix=credit_card_feature_group_s3_prefix
    )
    if "Contents" in objects_in_bucket and len(objects_in_bucket["Contents"]) > 1:
        offline_store_contents = objects_in_bucket["Contents"]
    else:
        print("Waiting for data in offline store...\n")
        time.sleep(60)

print("Data available.")

Use Amazon Athena to Query your Encrypted Data in your Feature Store

Using Amazon Athena, we query columns customer_id, age, and credit_score from your offline feature store where your encrypted data is.

[ ]:
credit_card_query = credit_card_feature_group.athena_query()

credit_card_table = credit_card_query.table_name

query_credit_card_table = 'SELECT customer_id, age, credit_score FROM "' + credit_card_table + '"'

print("Running " + query_credit_card_table)

# Run the Athena query
credit_card_query.run(
    query_string=query_credit_card_table,
    output_location="s3://" + s3_bucket_name + "/" + prefix + "/query_results/",
)
[ ]:
time.sleep(60)
[ ]:
credit_card_dataset = credit_card_query.as_dataframe()
[ ]:
print(credit_card_dataset.dtypes)
[ ]:
credit_card_dataset
[ ]:
int_to_bytes(credit_card_dataset, ["customer_id", "age", "credit_score"])
[ ]:
credit_card_dataset
[ ]:
decrypt_data_frame(credit_card_dataset, ["customer_id", "age", "credit_score"])

In this notebook, we queried a subset of encrypted features. From here you can now train a model on this new dataset while remaining privacy over other columns e.g., ssn.

[ ]:
credit_card_dataset

Clean Up Resources

Remove the Feature Group that was created.

[ ]:
credit_card_feature_group.delete()

Next Steps

In this notebook we covered client-side encryption with Feature Store. If you are interested in understanding how server-side encryption is done with Feature Store, see Feature Store: Encrypt Data in your Online or Offline Feature Store using KMS key.

For more information on the AWS Encryption library, see AWS Encryption SDK library.

For detailed information about Feature Store, see the Developer Guide.

[ ]: