Introduction to SageMaker JumpStart - Text Generation with Falcon models

Below is the content of the notebook.

Instruction fine-tuning
Domain adaptation fine-tuning

Install latest SageMaker and dependencies.

[ ]:

!pip install sagemaker --quiet --upgrade --force-reinstall
!pip install ipywidgets==7.0.0 --quiet

1. Instruction fine-tuning

Now, we demonstrate how to instruction-tune huggingface-llm-falcon-7b-instruct-bf16 model for a new task. As mentioned in Section 1.3 About the models, Falcon-7b-instruct/Falcon-40B-instruct models are instruction base falcon models fine-tuned on a mixture of chat and instruction datasets.

In this task, given a piece of context, the model is asked to generate questions that are relevant to the text, but ``cannot`` be answered based on provided information. Examples are given in the inference section of this notebook.

[ ]:

model_id, model_version = "huggingface-llm-falcon-7b-instruct-bf16", "*"

1.1. Preparing training data

We will use a subset of SQuAD2.0 for supervised fine-tuning. This dataset contains questions posed by human annotators on a set of Wikipedia articles. In addition to questions with answers, SQuAD2.0 contains about 50k unanswerable questions. Such questions are plausible, but cannot be directly answered from the articles’ content. We only use unanswerable questions for our task.

Citation: @article{rajpurkar2018know, title={Know what you don’t know: Unanswerable questions for SQuAD}, author={Rajpurkar, Pranav and Jia, Robin and Liang, Percy}, journal={arXiv preprint arXiv:1806.03822}, year={2018} }

License: Creative Commons Attribution-ShareAlike License (CC BY-SA 4.0)

[ ]:

import boto3
import sagemaker
import json

# Get current region, role, and default bucket
aws_region = boto3.Session().region_name
aws_role = sagemaker.session.Session().get_caller_identity_arn()
output_bucket = sagemaker.Session().default_bucket()
default_bucket_prefix = sagemaker.Session().default_bucket_prefix

# This will be useful for printing
newline, bold, unbold = "\n", "\033[1m", "\033[0m"

print(f"{bold}aws_region:{unbold} {aws_region}")
print(f"{bold}aws_role:{unbold} {aws_role}")
print(f"{bold}output_bucket:{unbold} {output_bucket}")

[ ]:

from sagemaker.s3 import S3Downloader

# We will use the train split of SQuAD2.0
original_data_file = "train-v2.0.json"

# The data was mirrored in the following bucket
original_data_location = (
    f"s3://sagemaker-example-files-prod-{aws_region}/datasets/text/squad2.0/{original_data_file}"
)
S3Downloader.download(original_data_location, ".")

The training data must be formatted in JSON lines (.jsonl) format, where each line is a dictionary representing a single data sample. All training data must be in a single folder, however it can be saved in multiple jsonl files. The .jsonl file extension is mandatory. The training folder can also contain a template.json file describing the input and output formats.

If no template file is given, the following default template will be used:

{
    "prompt": "Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Input:\n{context}`,
    "completion": "{response}",
}

In this case, the data in the JSON lines entries must include instruction, context, and response fields.

Different from using the default prompt template, in this demo we are going to use a custom template (see below).

[ ]:

template = {
    "prompt": "Ask a question which is related to the following text, but cannot be answered based on the text. Text: {context}",
    "completion": "{question}",
}

with open("template.json", "w") as f:
    json.dump(template, f)

Next, we are going to reformat the SQuAD 2.0 dataset. The processed data is saved as task-data.jsonl file. Given the prompt template defined in above cell, each entry in the task-data.jsonl file include ``context`` and ``question`` fields. For demonstration purpose, we limit the number of training examples to be 2000.

[ ]:

local_data_file = "task-data.jsonl"  # any name with .jsonl extension

with open(original_data_file) as f:
    data = json.load(f)


def preprocess_data(local_data_file, data, num_maximum_example):
    num_example_idx = 0
    with open(local_data_file, "w") as f:
        for article in data["data"]:
            for paragraph in article["paragraphs"]:
                # iterate over questions for a given paragraph
                for qas in paragraph["qas"]:
                    if qas["is_impossible"]:
                        # the question is relevant, but cannot be answered
                        example = {"context": paragraph["context"], "question": qas["question"]}
                        json.dump(example, f)
                        f.write("\n")
                        num_example_idx += 1
                        if num_example_idx >= num_maximum_example:
                            return


preprocess_data(local_data_file=local_data_file, data=data, num_maximum_example=10000)

Upload the prompt template (template.json) and training data (task-data.jsonl) into S3 bucket.

[ ]:

from sagemaker.s3 import S3Uploader

# If a default bucket prefix is specified, append it to the s3 path
if default_bucket_prefix:
    training_dataset_s3_path = f"s3://{output_bucket}/{default_bucket_prefix}/train_data324242"
else:
    training_dataset_s3_path = f"s3://{output_bucket}/train_data324242"

S3Uploader.upload(local_data_file, training_dataset_s3_path)
S3Uploader.upload("template.json", training_dataset_s3_path)
print(f"{bold}training data:{unbold} {training_dataset_s3_path}")

1.2. Prepare training parameters

[ ]:

from sagemaker import hyperparameters

my_hyperparameters = hyperparameters.retrieve_default(
    model_id=model_id, model_version=model_version
)
print(my_hyperparameters)

Overwrite the hyperparameters

[ ]:

my_hyperparameters["epoch"] = "2"
my_hyperparameters["per_device_train_batch_size"] = "2"
my_hyperparameters["gradient_accumulation_steps"] = "2"
my_hyperparameters["instruction_tuned"] = "True"
print(my_hyperparameters)

Validate hyperparameters

[ ]:

hyperparameters.validate(
    model_id=model_id, model_version=model_version, hyperparameters=my_hyperparameters
)

1.3. Starting training

Note. The parameter load_best_model_at_end (Whether or not to load the best model found during training at the end of training. When this option is enabled, the best checkpoint will always be saved) is set as “True” by default. During loading the best model checkpoints at the end of training (HuggingFace will load the best model checkpoints before saving it), there is overhead of memory usage which can lead to Out-Of-Memory error.

If setting load_best_model_at_end, we recommend to use ml.g5.48xlarge; if not, we recommend to use ml.g5.12xlarge.

[ ]:

from sagemaker.jumpstart.estimator import JumpStartEstimator

instruction_tuned_estimator = JumpStartEstimator(
    model_id=model_id,
    hyperparameters=my_hyperparameters,
    instance_type="ml.g5.48xlarge",
)
instruction_tuned_estimator.fit({"train": training_dataset_s3_path}, logs=True)

Extract Training performance metrics. Performance metrics such as training loss and validation accuracy/loss can be accessed through cloudwatch while the training. We can also fetch these metrics and analyze them within the notebook.

[ ]:

from sagemaker import TrainingJobAnalytics

training_job_name = instruction_tuned_estimator.latest_training_job.job_name

df = TrainingJobAnalytics(training_job_name=training_job_name).dataframe()
df.head(10)

1.4. Deploying inference endpoints

[ ]:

instruction_tuned_predictor = instruction_tuned_estimator.deploy()

We also deploy the pre-trained model for comparision in Section 2.5.

[ ]:

%%time
from sagemaker.jumpstart.model import JumpStartModel

my_model = JumpStartModel(model_id=model_id)
predictor = my_model.deploy()

1.5. Running inference queries and compare model performances

We examine three examples as listed in variable test_paragraphs. The prompt as defined in variable prompt asks the model to ask a question based on the context and make sure the question cannot be answered from the context.

We compare the performance of pre-trained Falcon instruct 7b (huggingface-llm-falcon-7b-instruct-bf16) that we deployed in Section 1 and fine-tuned Falcon instruct 7b.

[ ]:

prompt = "Ask a question which is related to the following text, but cannot be answered based on the text. Text: {context}"

# Sources: Wikipedia, AWS Documentation
test_paragraphs = [
    """
Adelaide is the capital city of South Australia, the state's largest city and the fifth-most populous city in Australia. "Adelaide" may refer to either Greater Adelaide (including the Adelaide Hills) or the Adelaide city centre. The demonym Adelaidean is used to denote the city and the residents of Adelaide. The Traditional Owners of the Adelaide region are the Kaurna people. The area of the city centre and surrounding parklands is called Tarndanya in the Kaurna language.
Adelaide is situated on the Adelaide Plains north of the Fleurieu Peninsula, between the Gulf St Vincent in the west and the Mount Lofty Ranges in the east. Its metropolitan area extends 20 km (12 mi) from the coast to the foothills of the Mount Lofty Ranges, and stretches 96 km (60 mi) from Gawler in the north to Sellicks Beach in the south.
""",
    """
Amazon Elastic Block Store (Amazon EBS) provides block level storage volumes for use with EC2 instances. EBS volumes behave like raw, unformatted block devices. You can mount these volumes as devices on your instances. EBS volumes that are attached to an instance are exposed as storage volumes that persist independently from the life of the instance. You can create a file system on top of these volumes, or use them in any way you would use a block device (such as a hard drive). You can dynamically change the configuration of a volume attached to an instance.
We recommend Amazon EBS for data that must be quickly accessible and requires long-term persistence. EBS volumes are particularly well-suited for use as the primary storage for file systems, databases, or for any applications that require fine granular updates and access to raw, unformatted, block-level storage. Amazon EBS is well suited to both database-style applications that rely on random reads and writes, and to throughput-intensive applications that perform long, continuous reads and writes.
""",
    """
Amazon Comprehend uses natural language processing (NLP) to extract insights about the content of documents. It develops insights by recognizing the entities, key phrases, language, sentiments, and other common elements in a document. Use Amazon Comprehend to create new products based on understanding the structure of documents. For example, using Amazon Comprehend you can search social networking feeds for mentions of products or scan an entire document repository for key phrases.
You can access Amazon Comprehend document analysis capabilities using the Amazon Comprehend console or using the Amazon Comprehend APIs. You can run real-time analysis for small workloads or you can start asynchronous analysis jobs for large document sets. You can use the pre-trained models that Amazon Comprehend provides, or you can train your own custom models for classification and entity recognition.
All of the Amazon Comprehend features accept UTF-8 text documents as the input. In addition, custom classification and custom entity recognition accept image files, PDF files, and Word files as input.
Amazon Comprehend can examine and analyze documents in a variety of languages, depending on the specific feature. For more information, see Languages supported in Amazon Comprehend. Amazon Comprehend's Dominant language capability can examine documents and determine the dominant language for a far wider selection of languages.
""",
]

[ ]:

parameters = {
    "max_new_tokens": 50,
    "do_sample": True,
    "top_k": 50,
    "top_p": 0.8,
    "do_sample": True,
    "temperature": 0.01,
}


def query_endpoint_with_json_payload(encoded_json, endpoint_name):
    client = boto3.client("runtime.sagemaker")
    response = client.invoke_endpoint(
        EndpointName=endpoint_name, ContentType="application/json", Body=encoded_json
    )
    return response


def parse_response(query_response):
    model_predictions = json.loads(query_response["Body"].read())
    return model_predictions[0]["generated_text"]


def generate_question(endpoint_name, text):
    expanded_prompt = prompt.replace("{context}", text)
    payload = {"inputs": expanded_prompt, "parameters": parameters}
    query_response = query_endpoint_with_json_payload(
        json.dumps(payload).encode("utf-8"), endpoint_name=endpoint_name
    )
    generated_texts = parse_response(query_response)
    print(f"Response: {generated_texts}{newline}")

[ ]:

print(f"{bold}Prompt:{unbold} {repr(prompt)}")
for paragraph in test_paragraphs:
    print("-" * 80)
    print(paragraph)
    print("-" * 80)
    print(f"{bold}pre-trained{unbold}")
    generate_question(predictor.endpoint_name, paragraph)
    print(f"{bold}fine-tuned{unbold}")
    generate_question(instruction_tuned_predictor.endpoint_name, paragraph)

The pre-trained model was not specifically trained to generate unanswerable questions. Despite the input prompt, it tends to generate questions that can be answered from the text. The fine-tuned model is generally better at this task, and the improvement is more prominent for larger models

1.6. Clean up the endpoint

[ ]:

# Delete the SageMaker endpoint
predictor.delete_model()
predictor.delete_endpoint()
instruction_tuned_predictor.delete_model()
instruction_tuned_predictor.delete_endpoint()

2. Domain adaptation fine-tuning

We also have domain adaptation fine-tuning enabled for Falcon models. Different from instruction fine-tuning, you do not need prepare instruction-formatted dataset and can directly use unstructured text document which is demonstrated as below. However, the model that is domain-adaptation fine-tuned may not give concise responses as the instruction-tuned model because of less restrictive requirements on training data formats.

In this demonstration, we use falcon text generation model huggingface-llm-falcon-7b-bf16. This is not an instruction-tuned version of Falcon 7B model. You can also conduct domain adaptation finetuning on top of instruction-tuned model like huggingface-llm-falcon-7b-instruct-bf16. However, we generally do not recommend that.

[ ]:

model_id = "huggingface-llm-falcon-7b-bf16"

We will use financial text from SEC filings to fine tune huggingface-llm-falcon-7b-bf16 for financial applications.

Here are the requirements for train and validation data.

Input: A train and an optional validation directory. Each directory contains a CSV/JSON/TXT file.
- For CSV/JSON files, the train or validation data is used from the column called ‘text’ or the first column if no column called ‘text’ is found.
- The number of files under train and validation (if provided) should equal to one.
Output: A trained model that can be deployed for inference.

Below is an example of a TXT file for fine-tuning the Text Generation model. The TXT file is SEC filings of Amazon from year 2021 to 2022.

```

SEC filings data of Amazon is downloaded from publicly available EDGAR. Instruction of accessing the data is shown here.

2.1. Preparing training data

The training data of SEC filing of Amazon has been pre-saved in the S3 bucket.

[ ]:

from sagemaker.jumpstart.utils import get_jumpstart_content_bucket

# Sample training data is available in this bucket
data_bucket = get_jumpstart_content_bucket(aws_region)
data_prefix = "training-datasets/sec_data"

training_dataset_s3_path = f"s3://{data_bucket}/{data_prefix}/train/"
validation_dataset_s3_path = f"s3://{data_bucket}/{data_prefix}/validation/"

2.2. Prepare training parameters

[ ]:

from sagemaker import hyperparameters

my_hyperparameters = hyperparameters.retrieve_default(
    model_id=model_id, model_version=model_version
)

my_hyperparameters["epoch"] = "3"
my_hyperparameters["per_device_train_batch_size"] = "2"
my_hyperparameters["instruction_tuned"] = "False"
print(my_hyperparameters)

Validate hyperparameters

[ ]:

hyperparameters.validate(
    model_id=model_id, model_version=model_version, hyperparameters=my_hyperparameters
)

2.3. Starting training

[ ]:

from sagemaker.jumpstart.estimator import JumpStartEstimator

domain_adaptation_estimator = JumpStartEstimator(
    model_id=model_id,
    hyperparameters=my_hyperparameters,
    instance_type="ml.p3dn.24xlarge",
)
domain_adaptation_estimator.fit(
    {"train": training_dataset_s3_path, "validation": validation_dataset_s3_path}, logs=True
)

Extract Training performance metrics. Performance metrics such as training loss and validation accuracy/loss can be accessed through cloudwatch while the training. We can also fetch these metrics and analyze them within the notebook

[ ]:

from sagemaker import TrainingJobAnalytics

training_job_name = domain_adaptation_estimator.latest_training_job.job_name

df = TrainingJobAnalytics(training_job_name=training_job_name).dataframe()
df.head(10)

2.4. Deploying inference endpoints

We deploy the domain-adaptation fine-tuned and pretrained models separately, and compare their performances.

We firstly deploy the domain-adaptation fine-tuned model.

[ ]:

domain_adaptation_predictor = domain_adaptation_estimator.deploy()

Next, we deploy the pre-trained huggingface-llm-falcon-7b-bf16.

[ ]:

my_model = JumpStartModel(model_id=model_id)
pretrained_predictor = my_model.deploy()

2.5. Running inference queries and compare model performances

[ ]:

parameters = {
    "max_new_tokens": 300,
    "top_k": 50,
    "top_p": 0.8,
    "do_sample": True,
    "temperature": 1,
}


def generate_response(endpoint_name, text):
    payload = {"inputs": f"{text}:", "parameters": parameters}
    query_response = query_endpoint_with_json_payload(
        json.dumps(payload).encode("utf-8"), endpoint_name=endpoint_name
    )
    generated_texts = parse_response(query_response)
    print(f"Response: {generated_texts}{newline}")

[ ]:

test_paragraph_domain_adaption = [
    "This Form 10-K report shows that",
    "We serve consumers through",
    "Our vision is",
]


for paragraph in test_paragraph_domain_adaption:
    print("-" * 80)
    print(paragraph)
    print("-" * 80)
    print(f"{bold}pre-trained{unbold}")
    generate_response(pretrained_predictor.endpoint_name, paragraph)
    print(f"{bold}fine-tuned{unbold}")
    generate_response(domain_adaptation_predictor.endpoint_name, paragraph)

As you can, the fine-tuned model starts to generate responses that are more specific to the domain of fine-tuning data which is relating to SEC report of Amazon.

2.6. Clean up the endpoint

[ ]:

# Delete the SageMaker endpoint
pretrained_predictor.delete_model()
pretrained_predictor.delete_endpoint()
domain_adaptation_predictor.delete_model()
domain_adaptation_predictor.delete_endpoint()