Amazon SageMaker Feature Store: Client-side Encryption using AWS Encryption SDK
This notebook demonstrates how client-side encryption with SageMaker Feature Store is done using the AWS Encryption SDK library to encrypt your data prior to ingesting it into your Online or Offline Feature Store. We first demonstrate how to encrypt your data using the AWS Encryption SDK library, and then show how to use Amazon Athena to query for a subset of encrypted columns of features for model training.
Currently, Feature Store supports encryption at rest and encryption in transit. With this notebook, we showcase an additional layer of security where your data is encrypted and then stored in your Feature Store. This notebook also covers the scenario where you want to query a subset of encrypted data using Amazon Athena for model training. This becomes particularly useful when you want to store encrypted data sets in a single Feature Store, and want to perform model training using only a subset of encrypted columns, forcing privacy over the remaining columns.
If you are interested in server side encryption with Feature Store, see Feature Store: Encrypt Data in your Online or Offline Feature Store using KMS key.
For more information on the AWS Encryption library, see AWS Encryption SDK library.
For detailed information about Feature Store, see the Developer Guide.
Overview
Set up
Load in and encrypt your data using AWS Encryption library (
aws-encryption-sdk
)Create Feature Group and ingest your encrypted data into it
Query your encrypted data in your feature store using Amazon Athena
Decrypt the data you queried
Prerequisites
This notebook uses the Python SDK library for Feature Store, the AWS Encryption SDK library, aws-encryption-sdk
and the Python 3 (DataScience)
kernel. To use theaws-encryption-sdk
library you will need to have an active KMS key that you created. If you do not have a KMS key, then you can create one by following the KMS Policy Template steps, or
you can visit the KMS section in the console and follow the button prompts for creating a KMS key. This notebook works with SageMaker Studio, Jupyter, and JupyterLab.
Library Dependencies:
sagemaker>=2.0.0
numpy
pandas
aws-encryption-sdk
Data
This notebook uses a synthetic data set that has the following features: customer_id
, ssn
(social security number), credit_score
, age
, and aims to simulate a relaxed data set that has some important features that would be needed during the credit card approval process.
[ ]:
import sagemaker
import pandas as pd
import numpy as np
[ ]:
pip install -q 'aws-encryption-sdk'
Set up
[ ]:
sagemaker_session = sagemaker.Session()
s3_bucket_name = sagemaker_session.default_bucket()
prefix = "sagemaker-featurestore-demo"
role = sagemaker.get_execution_role()
region = sagemaker_session.boto_region_name
Instantiate an encryption SDK client and provide your KMS ARN key to the StrictAwsKmsMasterKeyProvider
object. This will be needed for data encryption and decryption by the AWS Encryption SDK library. You will need to substitute your KMS Key ARN for kms_key
.
[ ]:
import aws_encryption_sdk
from aws_encryption_sdk.identifiers import CommitmentPolicy
client = aws_encryption_sdk.EncryptionSDKClient(
commitment_policy=CommitmentPolicy.REQUIRE_ENCRYPT_REQUIRE_DECRYPT
)
kms_key_provider = aws_encryption_sdk.StrictAwsKmsMasterKeyProvider(
key_ids=[kms_key] ## Add your KMS key here
)
Load in your data.
[ ]:
credit_card_data = pd.read_csv("data/credit_card_approval_synthetic.csv")
[ ]:
credit_card_data.head()
[ ]:
credit_card_data.dtypes
Client-Side Encryption Methods
Below are some methods that use the Amazon Encryption SDK library for data encryption, and decryption. Note that the data type of the encryption is byte which we convert to an integer prior to storing it into Feature Store and do the reverse prior to decrypting. This is because Feature Store doesn’t support byte format directly, thus why we convert the byte encryption to an integer.
[ ]:
def encrypt_data_frame(df, columns):
"""
Input:
df: A pandas Dataframe
columns: A list of column names.
Encrypt the provided columns in df. This method assumes that column names provided in columns exist in df,
and uses the AWS Encryption SDK library.
"""
for col in columns:
buffer = []
for entry in np.array(df[col]):
entry = str(entry)
encrypted_entry, encryptor_header = client.encrypt(
source=entry, key_provider=kms_key_provider
)
buffer.append(encrypted_entry)
df[col] = buffer
def decrypt_data_frame(df, columns):
"""
Input:
df: A pandas Dataframe
columns: A list of column names.
Decrypt the provided columns in df. This method assumes that column names provided in columns exist in df,
and uses the AWS Encryption SDK library.
"""
for col in columns:
buffer = []
for entry in np.array(df[col]):
decrypted_entry, decryptor_header = client.decrypt(
source=entry, key_provider=kms_key_provider
)
buffer.append(float(decrypted_entry))
df[col] = np.array(buffer)
def bytes_to_int(df, columns):
"""
Input:
df: A pandas Dataframe
columns: A list of column names.
Convert the provided columns in df of type bytes to integers. This method assumes that column names provided
in columns exist in df and that the columns passed in are of type bytes.
"""
for col in columns:
for index, entry in enumerate(np.array(df[col])):
df[col][index] = int.from_bytes(entry, "little")
def int_to_bytes(df, columns):
"""
Input:
df: A pandas Dataframe
columns: A list of column names.
Convert the provided columns in df of type integers to bytes. This method assumes that column names provided
in columns exist in df and that the columns passed in are of type integers.
"""
for col in columns:
buffer = []
for index, entry in enumerate(np.array(df[col])):
current = int(df[col][index])
current_bit_length = current.bit_length() + 1 # include the sign bit, 1
current_byte_length = (current_bit_length + 7) // 8
buffer.append(current.to_bytes(current_byte_length, "little"))
df[col] = pd.Series(buffer)
[ ]:
## Encrypt credit card data. Note that we treat `customer_id` as a primary key, and since it's encryption is unique we can encrypt it.
encrypt_data_frame(credit_card_data, ["customer_id", "age", "SSN", "credit_score"])
[ ]:
credit_card_data
[ ]:
print(credit_card_data.dtypes)
[ ]:
## Cast encryption of type bytes to an integer so it can be stored in Feature Store.
bytes_to_int(credit_card_data, ["customer_id", "age", "SSN", "credit_score"])
[ ]:
print(credit_card_data.dtypes)
[ ]:
credit_card_data
[ ]:
def cast_object_to_string(data_frame):
"""
Input:
data_frame: A pandas Dataframe
Cast all columns of data_frame of type object to type string.
"""
for label in data_frame.columns:
if data_frame.dtypes[label] == object:
data_frame[label] = data_frame[label].astype("str").astype("string")
return data_frame
credit_card_data = cast_object_to_string(credit_card_data)
[ ]:
print(credit_card_data.dtypes)
[ ]:
credit_card_data
Create your Feature Group and Ingest your encrypted data into it
Below we start by appending the EventTime
feature to your data to timestamp entries, then we load the feature definition, and instantiate the Feature Group object. Then lastly we ingest the data into your feature store.
[ ]:
from time import gmtime, strftime, sleep
credit_card_feature_group_name = "credit-card-feature-group-" + strftime("%d-%H-%M-%S", gmtime())
Instantiate a FeatureGroup object for credit_card_data
.
[ ]:
from sagemaker.feature_store.feature_group import FeatureGroup
credit_card_feature_group = FeatureGroup(
name=credit_card_feature_group_name, sagemaker_session=sagemaker_session
)
[ ]:
import time
current_time_sec = int(round(time.time()))
## Recall customer_id is encrypted therefore unique, and so it can be used as a record identifier.
record_identifier_feature_name = "customer_id"
Append the EventTime
feature to your data frame. This parameter is required, and time stamps each data point.
[ ]:
credit_card_data["EventTime"] = pd.Series(
[current_time_sec] * len(credit_card_data), dtype="float64"
)
[ ]:
credit_card_data.head()
[ ]:
print(credit_card_data.dtypes)
[ ]:
credit_card_feature_group.load_feature_definitions(data_frame=credit_card_data)
[ ]:
credit_card_feature_group.create(
s3_uri=f"s3://{s3_bucket_name}/{prefix}",
record_identifier_name=record_identifier_feature_name,
event_time_feature_name="EventTime",
role_arn=role,
enable_online_store=False,
)
[ ]:
time.sleep(60)
Ingest your data into your feature group.
[ ]:
credit_card_feature_group.ingest(data_frame=credit_card_data, max_workers=3, wait=True)
[ ]:
time.sleep(30)
Continually check your offline store until your data is available in it.
[ ]:
s3_client = sagemaker_session.boto_session.client("s3", region_name=region)
credit_card_feature_group_s3_uri = (
credit_card_feature_group.describe()
.get("OfflineStoreConfig")
.get("S3StorageConfig")
.get("ResolvedOutputS3Uri")
)
credit_card_feature_group_s3_prefix = credit_card_feature_group_s3_uri.replace(
f"s3://{s3_bucket_name}/", ""
)
offline_store_contents = None
while offline_store_contents is None:
objects_in_bucket = s3_client.list_objects(
Bucket=s3_bucket_name, Prefix=credit_card_feature_group_s3_prefix
)
if "Contents" in objects_in_bucket and len(objects_in_bucket["Contents"]) > 1:
offline_store_contents = objects_in_bucket["Contents"]
else:
print("Waiting for data in offline store...\n")
time.sleep(60)
print("Data available.")
Use Amazon Athena to Query your Encrypted Data in your Feature Store
Using Amazon Athena, we query columns customer_id
, age
, and credit_score
from your offline feature store where your encrypted data is.
[ ]:
credit_card_query = credit_card_feature_group.athena_query()
credit_card_table = credit_card_query.table_name
query_credit_card_table = 'SELECT customer_id, age, credit_score FROM "' + credit_card_table + '"'
print("Running " + query_credit_card_table)
# Run the Athena query
credit_card_query.run(
query_string=query_credit_card_table,
output_location="s3://" + s3_bucket_name + "/" + prefix + "/query_results/",
)
[ ]:
time.sleep(60)
[ ]:
credit_card_dataset = credit_card_query.as_dataframe()
[ ]:
print(credit_card_dataset.dtypes)
[ ]:
credit_card_dataset
[ ]:
int_to_bytes(credit_card_dataset, ["customer_id", "age", "credit_score"])
[ ]:
credit_card_dataset
[ ]:
decrypt_data_frame(credit_card_dataset, ["customer_id", "age", "credit_score"])
In this notebook, we queried a subset of encrypted features. From here you can now train a model on this new dataset while remaining privacy over other columns e.g., ssn
.
[ ]:
credit_card_dataset
Clean Up Resources
Remove the Feature Group that was created.
[ ]:
credit_card_feature_group.delete()
Next Steps
In this notebook we covered client-side encryption with Feature Store. If you are interested in understanding how server-side encryption is done with Feature Store, see Feature Store: Encrypt Data in your Online or Offline Feature Store using KMS key.
For more information on the AWS Encryption library, see AWS Encryption SDK library.
For detailed information about Feature Store, see the Developer Guide.
[ ]: