Amazon SageMaker Jumpstart - Text Embedding & Sentence Similarity

This notebook’s CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

Set Up
Select a model
Deploy an endpoint & Query endpoint
Getting Nearest Neighbor On Your Own Dataset
Getting the Accuracy of deployed model on the Amazon_SageMaker_FAQs dataset
Run Batch Transform

1. Set Up

[ ]:

%pip install --upgrade sagemaker --quiet

To train and host on Amazon Sagemaker, we need to setup and authenticate the use of AWS services. Here, we use the execution role associated with the current notebook instance as the AWS account role with SageMaker access. It has necessary permissions, including access to your data in S3.

[ ]:

import sagemaker, boto3, json
from sagemaker.session import Session

sagemaker_session = Session()
aws_role = sagemaker_session.get_caller_identity_arn()
aws_region = boto3.Session().region_name
sess = sagemaker.Session()

2. Select a pre-trained model

[ ]:

model_id = "huggingface-sentencesimilarity-gte-small"

[ ]:

import IPython
from ipywidgets import Dropdown
from sagemaker.jumpstart.notebook_utils import list_jumpstart_models
from sagemaker.jumpstart.filters import And


filter_value = And("task == sentencesimilarity", "framework == huggingface")
ss_models = list_jumpstart_models(filter=filter_value)

dropdown = Dropdown(
    value=model_id,
    options=ss_models,
    description="Sagemaker Pre-Trained Sentence Similarity Models:",
    style={"description_width": "initial"},
    layout={"width": "max-content"},
)
display(IPython.display.Markdown("## Select a pre-trained model from the dropdown below"))
display(dropdown)

3. Deploy an Endpoint & Query Endpoint

Using SageMaker, we can perform inference on the pre-trained model.

[ ]:

# Deploying the model
from sagemaker.jumpstart.model import JumpStartModel
from sagemaker.serializers import JSONSerializer

# The model is deployed on the ml.g5.2xlarge instance. To see all the supported parameters by the JumpStartModel
# class use this link - https://sagemaker.readthedocs.io/en/stable/api/inference/model.html#sagemaker.jumpstart.model.JumpStartModel
my_model = JumpStartModel(model_id=dropdown.value)
predictor = my_model.deploy()

3.1 Query Endpoint to Get Embeddings

You can query the endpoint with a batch of input texts within a json payload. Here, we send a single request to the endpoint and the parsed response is a list of the embedding vectors.

[ ]:

text1 = "How cute your dog is!"
text2 = "Your dog is so cute."
text3 = "The mitochondria is the powerhouse of the cell."

payload = [text1, text2, text3]

predictor.predict(json.dumps(payload).encode("utf-8"))

3.2 Query endpoint for Getting Nearest Neighbor

The deployed model facilitates the process of identifying the nearest neighbors to input queries within the corpus. When provided with queries and a corpus, the model will produce a list. For each query, the output will provide both the corpus_id, which denotes the position of the relevant corpus entry in the input corpus list, and a score indicating the degree of proximity to the query. Please keep in mind that when making requests to the SageMaker invoke endpoint, payloads are restricted to approximately 5MB, and the request timeout is set to 1 minute. If your corpus size exceeds these limits, please utilize the approach outlined in the “4. Getting Nearest Neighbor On Your Own Dataset” section.

corpus: Provide the list of inputs from which to find the nearest neighbour
queries: Provide the list of inputs for which to find the nearest neighbour from the corpus
top_k: The number of nearest neighbour to find from the corpus
mode: Supply it as “nn_corpus” for getting the nearest neighbors to input queries within the corpus

[ ]:

from sagemaker.serializers import JSONSerializer

predictor.serializer = JSONSerializer()
predictor.content_type = "application/json"

corpus = [
    "Amazon SageMaker is a fully managed service to prepare data and build, train, and deploy machine learning (ML) models for any use case with fully managed infrastructure, tools, and workflows.",
    "Amazon SageMaker stores code in ML storage volumes, secured by security groups and optionally encrypted at rest.",
    "Amazon SageMaker provides a full end-to-end workflow, but you can continue to use your existing tools with SageMaker. You can easily transfer the results of each stage in and out of SageMaker as your business requirements dictate.",
]
queries = [
    "What is Amazon SageMaker?",
    "How does Amazon SageMaker secure my code?",
    "What if I have my own notebook, training, or hosting environment?",
]

payload_nearest_neighbour = {"corpus": corpus, "queries": queries, "top_k": 3, "mode": "nn_corpus"}

query_response = predictor.predict(payload_nearest_neighbour)
print(query_response)

Clean up the endpoint

[ ]:

# Delete the SageMaker endpoint
predictor.delete_model()
predictor.delete_endpoint()

4. Getting Nearest Neighbor On Your Own Dataset

To find the nearest neighbor from your own dataset, you must provide it in the specified format during the training process. The training job will then generate embeddings for your dataset and save them along with the model. These embeddings will be utilized during inference to find the nearest neighbors for an input sentence. The process of finding the nearest neighbors once we have the embeddings is carried out using the Sentence Transformer and its util function. The nearest neighbor is based on the cosine similarity between the input sentence embedding and already computed sentence embeddings during the training job.

Required Data Format for the training job

Input: A directory containing a ‘data.csv’ file.
- Each row of the first column of ‘data.csv’ should have unique id
- Each row of the second column should have the corresponding text.
Output: A model prepackaged with input data embeddings that can be deployed for inference to get the nearest neighbor embedding id for an input sentence

Below is an example of ‘data.csv’ file showing values in its first two columns. Note that the file should not have any header.

1	“Amazon SageMaker is a fully managed service to prepare data and build, train, and deploy machine learning (ML) models for any use case with fully managed infrastructure, tools, and workflows.”
2	“For a list of the supported Amazon SageMaker AWS Regions, please visit the AWS Regional Services page. Also, for more information, see Regional endpoints in the AWS general reference guide.”

4.1. Getting Dataset

In this section, we’ll be fetching and prepping the Amazon_SageMaker_FAQs dataset to utilize it in finding the nearest neighbour to an input question. ***

[ ]:

# Getting the Data for Training
!aws s3 cp s3://jumpstart-cache-prod-us-west-2/training-datasets/Amazon_SageMaker_FAQs/Amazon_SageMaker_FAQs.csv Amazon_SageMaker_FAQs.csv

[ ]:

# Preparing the Data in the required format

import pandas as pd

data = pd.read_csv("Amazon_SageMaker_FAQs.csv", names=["Questions", "Answers"])
data["id"] = data.index

data_req = data[["id", "Answers"]]

data_req.to_csv("data.csv", index=False, header=False)

[ ]:

# Uploading the data in required format to s3 Bucket
output_bucket = sess.default_bucket()
output_prefix = "jumpstart-example-ss-training"

s3_output_location = f"s3://{output_bucket}/{output_prefix}/output"
training_dataset_s3_path = f"s3://{output_bucket}/{output_prefix}/data/data.csv"

!aws s3 cp data.csv {training_dataset_s3_path}

4.2. Set Training parameters

There are two kinds of parameters that need to be set for training.

The first one are the parameters for the training job. These include: (i) Training data path. This is S3 folder in which the input data is stored, (ii) Output path: This the s3 folder in which the training output is stored. (iii) Training instance type: This indicates the type of machine on which to run the training. Typically, we use GPU instances for these training. The second set of parameters are algorithm specific training hyper-parameters.

[ ]:

from sagemaker import hyperparameters

# Retrieve the default hyper-parameters for fine-tuning the model
hyperparameters = hyperparameters.retrieve_default(model_id=dropdown.value, model_version="*")

# [Optional] Override default hyperparameters with custom values
# max_seq_length parameter is the max sequence length of the input to process by the embedding model. The default None value results in using the default max_seq_length for the model.
hyperparameters["batch_size"] = "64"
print(hyperparameters)

4.3. Getting the Embeddings for the Input Data

We start by creating the estimator object with all the required assets and then launch the training job. ***

[ ]:

from sagemaker.jumpstart.estimator import JumpStartEstimator

estimator = JumpStartEstimator(
    model_id=dropdown.value, hyperparameters=hyperparameters, output_path=s3_output_location
)

[ ]:

# Launch a SageMaker Training job by passing s3 path of the data
estimator.fit({"training": f"s3://{output_bucket}/{output_prefix}/data"}, logs=True)

4.4. Deploy & run Inference on the model

The deployed model can be used for running inference. We support two types of the inference methods on the model. We follow the same steps as in 3. Deploy an Endpoint & Query Endpoint

[ ]:

# Use the estimator from the previous step to deploy to a SageMaker endpoint
predictor = estimator.deploy()

4.5 Query endpoint

You can query the endpoint with a batch of input texts within a json payload. Here, we send a single request to the endpoint and the parsed response is a list of the embedding vectors.

[ ]:

payload = ["Is R supported with Amazon SageMaker?"]

response = predictor.predict(json.dumps(payload).encode("utf-8"))

print(response)

Query Endpoint to Get Nearest Neighbor

You also have the option to make queries to the endpoint using a JSON payload containing a batch of input texts, to find the nearest neighbors of the input text from the dataset which is provided during the training job.

queries: Provide the list of inputs for which to find the closest match from the training data
top_k: The number of closest match to find from the training data
mode: Supply it as “nn_train_data” for getting the nearest neighbors to input queries within the dataset provided

[ ]:

from sagemaker.serializers import JSONSerializer

newline = "\n"
predictor.serializer = JSONSerializer()
predictor.content_type = "application/json"

payload_nearest_neighbour = {
    "queries": ["Is R supported with Amazon SageMaker?"],
    "top_k": 1,
    "mode": "nn_train_data",
}

response = predictor.predict(payload_nearest_neighbour)

print("The nearest neighbour for the input question is - ", response)

question = payload_nearest_neighbour["queries"][0]
answer = data["Answers"].iloc[int(response[0][0]["id"])]
# Relating the Input Question with the Answer
print(f"The input Question is: {question}{newline}" f"The Corresponding Answer is: {answer}")

5. Getting the Accuracy of deployed model on the Amazon_SageMaker_FAQs dataset

We will Query the endpoint for the questions in our Amazon_SageMaker_FAQs dataset and will compare if we get the correct corresponding answer using our sentence similarity model. ***

[ ]:

total_correct_answers = 0
for i in range(len(data)):
    question = data["Questions"].iloc[i]
    payload_nearest_neighbour = {"queries": [question], "top_k": 1, "mode": "nn_train_data"}
    response = predictor.predict(payload_nearest_neighbour)

    response_id = response[0][0]["id"]

    if int(response_id) == i:
        total_correct_answers += 1

print(
    f"The accuracy of the model on the Amazon_SageMaker_FAQs dataset is: {total_correct_answers*100/len(data)}"
)

[ ]:

# Delete the SageMaker endpoint and the attached resources
predictor.delete_model()
predictor.delete_endpoint()

6. Run Batch Transform to Get Embeddings On Large Datasets

Using SageMaker, we can perform batch inference on the model for large datasets. For this example, that means on an input sentence providing the embedding. When you start a batch transform job, Amazon SageMaker launches the necessary compute resources to process the data, including CPU or GPU instances depending on the selected instance type. During the batch transform job, Amazon SageMaker automatically provisions and manages the compute resources required to process the data, including instances, storage, and networking resources. Once the batch transform job has completed, the compute resources are automatically cleaned up by Amazon SageMaker. This means that the instances and storage used during the job are terminated and removed, freeing up resources and minimizing costs

Batch Transform is useful in the following scenarios:
- Preprocess datasets to remove noise or bias that interferes with training or inference from your dataset.
- Get inferences from large datasets.
- Run inference when you don’t need a persistent endpoint.
- Associate input records with inferences to assist the interpretation of results.

The input format for the batch transform job is a jsonl file with entries as -> - {“id”:1,“text_inputs”:“How cute your dog is!”} - {“id”:2,“text_inputs”:“The mitochondria is the powerhouse of the cell.”}

While the output format is -> - {“id”:1, “embedding”:[0.025507507845759392, 0.009654928930103779, -0.01139055471867323, ………]} - {“id”:2, “embedding”:[-0.018594933673739433, -0.011756304651498795, -0.006888044998049736,…..]}

6.1. Prepare data for Batch Transform

[ ]:

s3_input_data_path = f"s3://{output_bucket}/{output_prefix}/batch_input/"
s3_output_data_path = f"s3://{output_bucket}/{output_prefix}/batch_output/"

[ ]:

import json
import boto3
import os

data = pd.read_csv("Amazon_SageMaker_FAQs.csv", names=["Questions", "Answers"])

# Provide the test data and the ground truth file name
test_data_file_name = "test.jsonl"

test_data = []

# We will go over each data entry and create the data in the input required format as described above
for i in range(len(data)):
    answer = data.loc[i, "Answers"]
    payload = {"id": i, "text_inputs": answer}
    test_data.append(payload)

with open(test_data_file_name, "w") as outfile:
    for entry in test_data:
        outfile.write(f"{json.dumps(entry)}\n")

# Uploading the data
s3 = boto3.client("s3")
s3.upload_file(test_data_file_name, output_bucket, f"{output_prefix}/batch_input/test.jsonl")

6.2. Run Batch Transform

[ ]:

# Creating the batch transformer object. If you have a large dataset you can
# divide it into smaller chunks and use more instances for faster inference
my_model = JumpStartModel(model_id=dropdown.value, model_version="1.*")

batch_transformer = my_model.transformer(
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    output_path=s3_output_data_path,
    assemble_with="Line",
    accept="text/csv",
    max_payload=1,
)

# Making the predictions on the input data
batch_transformer.transform(
    s3_input_data_path, content_type="application/jsonlines", split_type="Line"
)

batch_transformer.wait()

[ ]:

# Downloading the Generated Embeddings

# Downloading the predictions
s3.download_file(
    output_bucket, output_prefix + "/batch_output/" + "test.jsonl.out", "predict.jsonl"
)

with open("predict.jsonl", "r") as json_file:
    json_list = list(json_file)

[ ]:

# Creating the predictions list which can be used to extract the embeddings given the id
import ast

predict_dict_list = []
for predict in json_list:
    if len(predict) > 1:
        predict_dict = ast.literal_eval(predict)
        predict_dict_req = {
            "id": predict_dict["id"],
            "embedding": predict_dict["embedding"],
        }
        predict_dict_list.append(predict_dict_req)

Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.