Introduction to JumpStart - Text Embedding

This notebook’s CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

Set Up
Select a model
Retrieve JumpStart Artifacts & Deploy an Endpoint
Query endpoint and parse response
Semantic Textual Similarity
Clean up the endpoint

Note: This notebook was tested on ml.t3.medium instance in Amazon SageMaker Studio with Python 3 (Data Science) kernel and in Amazon SageMaker Notebook instance with conda_python3 kernel.

1. Set Up

[ ]:

! pip install sagemaker ipywidgets --upgrade --quiet

Permissions and environment variables

[ ]:

import sagemaker, boto3, json
from sagemaker import get_execution_role

aws_role = get_execution_role()
aws_region = boto3.Session().region_name
sess = sagemaker.Session()

2. Select a model

Here, we download jumpstart model_manifest file from the jumpstart s3 bucket, filter-out all the Text Embedding models and select a model for inference. ***

[ ]:

from ipywidgets import Dropdown

# download JumpStart model_manifest file.
boto3.client("s3").download_file(
    f"jumpstart-cache-prod-{aws_region}", "models_manifest.json", "models_manifest.json"
)
with open("models_manifest.json", "rb") as json_file:
    model_list = json.load(json_file)

# filter-out all the Text Embedding models from the manifest list.
text_embedding_models = []
for model in model_list:
    model_id = model["model_id"]
    if "-tcembedding-" in model_id and model_id not in text_embedding_models:
        text_embedding_models.append(model_id)

# display the model-ids in a dropdown to select a model for inference.
model_dropdown = Dropdown(
    options=text_embedding_models,
    value="tensorflow-tcembedding-bert-en-uncased-L-10-H-128-A-2-2",
    description="Select a model",
    style={"description_width": "initial"},
    layout={"width": "max-content"},
)

Chose a model for Inference

[ ]:

display(model_dropdown)

[ ]:

# model_version="*" fetches the latest version of the model
model_id, model_version = model_dropdown.value, "*"

3. Retrieve JumpStart Artifacts & Deploy an Endpoint

[ ]:

from sagemaker.jumpstart.model import JumpStartModel

model = JumpStartModel(model_id=model_id)
model_predictor = model.deploy()

4. Query endpoint and parse response

[ ]:

def query(model_predictor, text):
    """Query the model predictor."""

    encoded_text = text.encode("utf-8")

    query_response = model_predictor.predict(
        encoded_text,
        {
            "ContentType": "application/x-text",
            "Accept": "application/json",
        },
    )
    return query_response


def parse_response(query_response):
    """Parse response and return the embedding."""

    model_predictions = query_response
    translation_text = model_predictions["embedding"]
    return translation_text

[ ]:

newline, bold, unbold = "\n", "\033[1m", "\033[0m"

input_text = "astonishing ... ( frames ) profound ethical and philosophical questions in the form of dazzling pop entertainment"


query_response = query(model_predictor, input_text)

embedding = parse_response(query_response)

print(
    f"{bold}Inference{unbold}:{newline}"
    f"{bold}Input text sentence{unbold}: '{input_text}'{newline}"
    f"{bold}The first 5 elements of sentence embedding{unbold}: {embedding[:5]}{newline}"
    f"{bold}Sentence embedding size{unbold}: {len(embedding)}{newline}"
)

5. Semantic Textual Similarity

A use case of sentence embedding is to cluster together sentences with similar semantic meaning. In the example below we compute the embeddings of sentences in three categories: pets, cities in the U.S., and color. We see that sentences originating from the same category have much closer embedding vectors than those from different categories.

Specifically, the code will do the following: * The endpoint that you have created above will output an embedding vector for each sentence;

* The distance between any pair of sentences is computed by the cosine similarity of corresponded embedding vectors; * A heatmap is created to visualize the distance between any pair of sentences in the embedding space. Darker the color, larger the cosine similarity (smaller the distance).

Note. Cosine similarity of two vectors is the inner product of the normalized vectors (scale down to have length 1).

[ ]:

from sklearn.preprocessing import normalize
import numpy as np
import seaborn as sns


def plot_similarity_heatmap(text_labels, embeddings, rotation):
    """Takes sentences, embeddings and rotation as input and plot similarity heat map.

    Args:
      text_labels: a list of sentences to compute semantic textual similarity search.
      embeddings: a list of embedding vectors, each of which corresponds to a sentence.
      rotation: rotation used for display of the text_labels.
    """
    inner_product = np.inner(embeddings, embeddings)
    sns.set(font_scale=1.1)
    graph = sns.heatmap(
        inner_product,
        xticklabels=text_labels,
        yticklabels=text_labels,
        vmin=np.min(inner_product),
        vmax=1,
        cmap="OrRd",
    )
    graph.set_xticklabels(text_labels, rotation=rotation)
    graph.set_title("Semantic Textual Similarity Between Sentences")


sentences = [
    # Pets
    "Your dog is so cute.",
    "How cute your dog is!",
    "You have such a cute dog!",
    # Cities in the US
    "New York City is the place where I work.",
    "I work in New York City.",
    # Color
    "What color do you like the most?",
    "What is your favourite color?",
]

embeddings = []

for sentence in sentences:
    query_response = query(model_predictor, sentence)
    embedding = parse_response(query_response)
    embeddings.append(embedding)

embeddings = normalize(np.array(embeddings), axis=1)  # normalization before inner product
plot_similarity_heatmap(sentences, embeddings, 90)

6. Clean up the endpoint

[ ]:

# Delete the SageMaker endpoint
model_predictor.delete_model()
model_predictor.delete_endpoint()

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.