SageMaker JumpStart Foundation Models - HuggingFace Text2Text Generation Batch Transform and Real-Time Batch Inference

This notebook’s CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

Note: This notebook was tested on ml.t3.medium instance in Amazon SageMaker Studio with Python 3 (Data Science) kernel and in Amazon SageMaker Notebook instance with conda_python3 kernel with Python 3.10.10.

Set Up
Select a pre-trained model
Retrieve Artifacts for Model
Specify Batch Transform Job HyperParameters
Prepare Data for Batch Transform
Run Batch Transform Job
Computing Rouge Score
Real-Time Batch Inference
Conclusion

1. Set Up

[ ]:

!pip install datasets==2.12.0 --quiet
!pip install evaluate==0.4.0 --quiet
!pip install ipywidgets==8.0.6 --quiet
!pip install rouge_score==0.1.2 --quiet
!pip install sagemaker==2.165.0 --quiet

[ ]:

from platform import python_version

tested_version = "3.10."

version = python_version()
print(f"You are using Python {version}")

if not version.startswith(tested_version):
    print(f"This notebook was tested with {tested_version}")
    print("Some parts might behave unexpectedly with a different Python version")

Permissions and environment variables

[ ]:

import sagemaker, boto3, json
from sagemaker.session import Session

sagemaker_session = Session()
aws_role = sagemaker_session.get_caller_identity_arn()
aws_region = boto3.Session().region_name
sess = sagemaker.Session()

2. Select a pre-trained model

You can continue with the default model, or can choose a different model from the dropdown generated upon running the next cell. A complete list of SageMaker pre-trained models can also be accessed at Sagemaker pre-trained Models. ***

[ ]:

model_id = "huggingface-text2text-flan-t5-large"
model_version = "1.*"

[Optional] Select a different Sagemaker pre-trained model. Here, we download the model_manifest file from the Built-In Algorithms s3 bucket, filter-out all the Text2Text Generation models and select a model for inference. ***

[ ]:

from ipywidgets import widgets
from sagemaker.jumpstart.notebook_utils import list_jumpstart_models

# Retrieves all Text2Text Generation models available by SageMaker Built-In Algorithms.
filter_value = "task == text2text"
text2text_generation_models = list_jumpstart_models(filter=filter_value)

# display the model-ids in a dropdown to select a model for inference.
model_dropdown = widgets.Dropdown(
    options=text2text_generation_models,
    value=model_id,
    description="Select a model",
    style={"description_width": "initial"},
    layout={"width": "max-content"},
)

Choose a model for Inference

[ ]:

display(model_dropdown)

[ ]:

# model_version="*" fetches the latest version of the model
model_id, model_version = model_dropdown.value, "1.*"

3. Retrieve Artifacts for Model

Using SageMaker, we can perform inference on the pre-trained model, even without fine-tuning it first on a new dataset. We start by retrieving the deploy_image_uri, and model_uri for the pre-trained model.

[ ]:

from sagemaker import image_uris, model_uris
from sagemaker.jumpstart.model import JumpStartModel
from sagemaker.predictor import Predictor

inference_instance_type = "ml.p3.8xlarge"

# Note that larger instances, e.g., "ml.g5.12xlarge" might be required for larger models,
# such as huggingface-text2text-flan-t5-xxl or huggingface-text2text-flan-ul2-bf16
# However, at present ml.g5.* instances are not supported in batch transforms.
# Thus, if using such an instance, please skip Sections 6 and 7 of this notebook.

# Retrieve the inference docker container uri. This is the base HuggingFace container image for the default model above.
deploy_image_uri = image_uris.retrieve(
    region=None,
    framework=None,  # automatically inferred from model_id
    image_scope="inference",
    model_id=model_id,
    model_version=model_version,
    instance_type=inference_instance_type,
)

# Retrieve the model uri.
model_uri = model_uris.retrieve(
    model_id=model_id, model_version=model_version, model_scope="inference"
)

model = JumpStartModel(
    model_id=model_id,
    image_uri=deploy_image_uri,
    model_data=model_uri,
    role=aws_role,
    predictor_cls=Predictor,
)

4. Specify Batch Transform Job HyperParameters

Batch transform jobs support many advanced parameters while performing inference. They include:

max_length: Model generates text until the output length (which includes the input context length) reaches max_length. If specified, it must be a positive integer.
num_return_sequences: Number of output sequences returned. If specified, it must be a positive integer.
num_beams: Number of beams used in the greedy search. If specified, it must be integer greater than or equal to num_return_sequences.
no_repeat_ngram_size: Model ensures that a sequence of words of no_repeat_ngram_size is not repeated in the output sequence. If specified, it must be a positive integer greater than 1.
temperature: Controls the randomness in the output. Higher temperature results in output sequence with low-probability words and lower temperature results in output sequence with high-probability words. If temperature -> 0, it results in greedy decoding. If specified, it must be a positive float.
early_stopping: If True, text generation is finished when all beam hypotheses reach the end of stence token. If specified, it must be boolean.
do_sample: If True, sample the next word as per the likelyhood. If specified, it must be boolean.
top_k: In each step of text generation, sample from only the top_k most likely words. If specified, it must be a positive integer.
top_p: In each step of text generation, sample from the smallest possible set of words with cumulative probability top_p. If specified, it must be a float between 0 and 1.
seed: Fix the randomized state for reproducibility. If specified, it must be an integer.
batch_size: Batching could speed things up, it may be useful to try tuning the batch_size parameter. In some cases can actually be quite slower as explained here. We will use a batch_size of 4. But If you get Cuda Out of Memory Error please decrease it accordingly. If specified, it must be a positive integer.

We may specify any subset of the hyper parameters mentioned above for a batch transform job. These hyperparameters will be passed as an env variables to the batch transform job. If any of these are passed, then the advanced hyper-parameters from the individual examples in the JSONLINES payload will not be used. If you don’t want to pass it, set it to None, and in that case advanced hyper-parameters from the individual examples in the JSONLINES payload will be used and inference will be run on each example at a time.

[ ]:

# Specify the batch job hyperparameters here, If you want to treate each example hyperparameters different please pass hyper_params_dict as None
hyper_params = {"max_length": 30, "top_k": 50, "top_p": 0.95, "do_sample": True}
hyper_params_dict = {"HYPER_PARAMS": str(hyper_params)}

5. Prepare Data for Batch Transform

If you want to specify different parameters for each test input rather than same for whole batch please do so here while creating the dataset. - The input format for the batch transform job is a jsonl file with entries as ->
{"id":1,"text_inputs":"Translate to German:  My name is Arthur", "max_length":50,.....}
{"id":2,"text_inputs":"Tell me the steps to make a pizza", "max_length":50,.....} - While the output format is ->
{"id":1, generated_texts':['Ich bin Arthur']}
{"id":2, 'generated_texts':['Preheat oven to 400 degrees F. Spread pizza sauce on a pizza pan. Bake for']}

[ ]:

# We will use the cnn_dailymail dataset from HuggingFace over here
from datasets import load_dataset

cnn_test = load_dataset("cnn_dailymail", "3.0.0", split="test")
# Choosing a smaller dataset for demo purposes. You can use the complete dataset as well.
cnn_test = cnn_test.select(list(range(20)))

[ ]:

# We will use a default s3 bucket for providing the input and output paths for batch transform
output_bucket = sess.default_bucket()
s3_bucket_prefix = "jumpstart-example-text2text-batch-transform"
default_bucket_prefix = sess.default_bucket_prefix

# If a default bucket prefix is specified, append it to the s3 path
if default_bucket_prefix:
    output_prefix = f"{default_bucket_prefix}/{s3_bucket_prefix}"
else:
    output_prefix = s3_bucket_prefix

s3_input_data_path = f"s3://{output_bucket}/{output_prefix}/batch_input/"
s3_output_data_path = f"s3://{output_bucket}/{output_prefix}/batch_output/"

[ ]:

# You can specify a prompt here
prompt = "Briefly summarize this text: "

[ ]:

import json
import boto3
import os

# Provide the test data and the ground truth file name
test_data_file_name = "articles.jsonl"
test_reference_file_name = "highlights.jsonl"

test_articles = []
test_highlights = []

# We will go over each data entry and create the data in the input required format as described above
for i, test_entry in enumerate(cnn_test):
    article = test_entry["article"]
    highlights = test_entry["highlights"]
    # Create a payload like this if you want to have different hyperparameters for each test input
    # payload = {"id": id,"text_inputs": f"{prompt}{article}", "max_length": 100, "temperature": 0.95}
    # Note that if you specify hyperparameter for each payload individually,
    # you may want to ensure that hyper_params_dict is set to None instead
    payload = {"id": i, "text_inputs": f"{prompt}{article}"}
    test_articles.append(payload)
    test_highlights.append({"id": i, "highlights": highlights})

with open(test_data_file_name, "w") as outfile:
    for entry in test_articles:
        outfile.write(f"{json.dumps(entry)}\n")

with open(test_reference_file_name, "w") as outfile:
    for entry in test_highlights:
        outfile.write(f"{json.dumps(entry)}\n")

# Uploading the data
s3 = boto3.client("s3")
s3.upload_file(test_data_file_name, output_bucket, f"{output_prefix}/batch_input/articles.jsonl")

6. Run Batch Transform Job

When you start a batch transform job, Amazon SageMaker launches the necessary compute resources to process the data, including CPU or GPU instances depending on the selected instance type. During the batch transform job, Amazon SageMaker automatically provisions and manages the compute resources required to process the data, including instances, storage, and networking resources. Once the batch transform job has completed, the compute resources are automatically cleaned up by Amazon SageMaker. This means that the instances and storage used during the job are terminated and removed, freeing up resources and minimizing costs.

[ ]:

# Creating the batch transformer object. If you have a large dataset you can
# divide it into smaller chunks and use more instances for faster inference
batch_transformer = model.transformer(
    instance_count=1,
    instance_type=inference_instance_type,
    output_path=s3_output_data_path,
    assemble_with="Line",
    accept="text/csv",
    max_payload=1,
)
batch_transformer.env = hyper_params_dict

# Making the predictions on the input data
batch_transformer.transform(
    s3_input_data_path, content_type="application/jsonlines", split_type="Line"
)

batch_transformer.wait()

7. Computing Rouge Score

ROUGE, or Recall-Oriented Understudy for Gisting Evaluation, is a set of metrics and a software package used for evaluating automatic summarization and machine translation in natural language processing. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation.

[ ]:

import ast
import evaluate
import pandas as pd

# Downloading the predictions
s3.download_file(
    output_bucket, output_prefix + "/batch_output/" + "articles.jsonl.out", "predict.jsonl"
)

with open("predict.jsonl", "r") as json_file:
    json_list = list(json_file)

# Creating the prediction list for the dataframe
predict_dict_list = []
for predict in json_list:
    if len(predict) > 1:
        predict_dict = ast.literal_eval(predict)
        predict_dict_req = {
            "id": predict_dict["id"],
            "prediction": predict_dict["generated_texts"][0],
        }
        predict_dict_list.append(predict_dict_req)

# Creating the predictions dataframe
predict_df = pd.DataFrame(predict_dict_list)

test_highlights_df = pd.DataFrame(test_highlights)

[ ]:

# Combining the predict dataframe with the original summarization on id to compute the rouge score
df_merge = test_highlights_df.merge(predict_df, on="id", how="left")

rouge = evaluate.load("rouge")
results = rouge.compute(
    predictions=list(df_merge["prediction"]), references=list(df_merge["highlights"])
)
print(results)

## Delete the SageMaker model
batch_transformer.delete_model()

8. Real-Time Batch Inference

We can also run the real-time batch inference on an endpoint by providing the inputs as a list. Real-time batch inference is useful in situations where you need to process a continuous stream of data in near real-time, but it is not feasible to process each data point individually due to time or resource constraints. Instead, you process the data in small batches, which enables you to take advantage of parallel processing while still maintaining low latency.

8.1 Deploying the Model

[ ]:

from sagemaker.utils import name_from_base

endpoint_name = name_from_base(f"jumpstart-example-{model_id}")
# Deploy the Model. Note that we need to pass Predictor class when we deploy model through Model class,
# for being able to run inference through the Sagemaker API.
model_predictor = model.deploy(
    initial_instance_count=1,
    instance_type=inference_instance_type,
    endpoint_name=endpoint_name,
)

8.2 Running Inference on the Model

[ ]:

# Provide all the text inputs to the model as a list
text_inputs = [entry["text_inputs"] for entry in test_articles[:10]]

[ ]:

# The information about the different parameters is provided above
payload = {
    "text_inputs": text_inputs,
    "max_length": 30,
    "num_return_sequences": 1,
    "top_k": 50,
    "top_p": 0.95,
    "do_sample": True,
}


def query_endpoint_with_json_payload(encoded_json, endpoint_name):
    client = boto3.client("runtime.sagemaker")
    response = client.invoke_endpoint(
        EndpointName=endpoint_name, ContentType="application/json", Body=encoded_json
    )
    return response


def parse_response_multiple_texts(query_response):
    model_predictions = json.loads(query_response["Body"].read())
    return model_predictions


query_response = query_endpoint_with_json_payload(
    json.dumps(payload).encode("utf-8"), endpoint_name=endpoint_name
)
generated_text_list = parse_response_multiple_texts(query_response)
print(*generated_text_list, sep="\n")

[ ]:

# Delete the SageMaker endpoint
model_predictor.delete_model()
model_predictor.delete_endpoint()

9. Conclusion

In this notebook, we conducted batch transform and real-time batch inference. Additionally, we used the Rouge score to compare the test data summarization with the model-generated summarization. We found that batch transform is advantageous in obtaining inferences from large datasets without requiring a persistent endpoint. Furthermore, we linked input records with inferences to aid in result interpretation. On the other hand, teal-time batch inference is beneficial in achieving high throughput. ***

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.

[ ]: