Retrieval-Augmented Generation: Question Answering using LLama-2, Pinecone & Custom Dataset

In this notebook we will demonstrate how to use Llama-2-7b to answer questions using a library of documents as a reference, by using document embeddings and retrieval. The embeddings are generated from MiniLM embedding model and retrieved from Pinecone Vector Database. Access to a Pinecone environment is a prerequisite to run this notebook fully.

You can start by using theFree Tier on Pinecone. This notebook serves a template such that you can easily replace the example dataset by your own to build a custom question and asnwering application.

To perform inference on the Llama models, you need to pass custom_attributes=‘accept_eula=true’ as part of header. This means you have read and accept the end-user-license-agreement (EULA) of the model. EULA can be found in model card description or from this webpage. By default, this notebook sets custom_attributes=‘accept_eula=false’, so all inference requests will fail until you explicitly change this custom attribute.

Note: Custom_attributes used to pass EULA are key/value pairs. The key and value are separated by ‘=’ and pairs are separated by ‘;’. If the user passes the same key more than once, the last value is kept and passed to the script handler (i.e., in this case, used for conditional logic). For example, if ‘accept_eula=false; accept_eula=true’ is passed to the server, then ‘accept_eula=true’ is kept and passed to the script handler.

Step 1. Deploy Llama-2 7 Billion Chat Model in SageMaker JumpStart

[18]:

!pip install -qU \
    sagemaker \
    pinecone-client==2.2.1 \
    ipywidgets==7.0.0

DEPRECATION: pyodbc 4.0.0-unsupported has a non-standard version number. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pyodbc or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

To begin, we will initialize all of the SageMaker session variables we’ll need to use throughout the walkthrough.

[24]:

import sagemaker
from sagemaker.jumpstart.model import JumpStartModel
from sagemaker.huggingface import HuggingFaceModel

role = sagemaker.get_execution_role()

my_model = JumpStartModel(model_id = "meta-textgeneration-llama-2-7b-f")

We will use a ml.g5.4xlarge instance to deploy our Llama-2-7 billion model. We can find pricing for all instances here.

[21]:

predictor = my_model.deploy(
    initial_instance_count=1, instance_type="ml.g5.4xlarge", endpoint_name="llama-2-generator")

---------------!

Step 2. Ask a question to LLM without providing the context

To better illustrate why we need retrieval-augmented generation (RAG) based approach to solve the question and anwering problem. Let’s directly ask the model a question and see how they respond.

[22]:

question = "Which instances can I use with Managed Spot Training in SageMaker?"

[104]:

# https://aws.amazon.com/blogs/machine-learning/llama-2-foundation-models-from-meta-are-now-available-in-amazon-sagemaker-jumpstart/

prompt = """Answer the following QUESTION based on the CONTEXT
given. If you do not know the answer and the CONTEXT doesn't
contain the answer truthfully say "I don't know

ANSWER:

"""


payload = {
    "inputs":
      [
        [
         {"role": "system", "content": prompt},
         {"role": "user", "content": question},
        ]
      ],
   "parameters":{"max_new_tokens": 64, "top_p": 0.9, "temperature": 0.6, "return_full_text": False}
}

out = predictor.predict(payload, custom_attributes='accept_eula=true')
out[0]['generation']['content']

[104]:

' Based on the context provided, Managed Spot Training in SageMaker allows you to use the following instances:\n\n* m5.xlarge\n* m5.2xlarge\n* m5.4xlarge\n* m5.8xlarge\n* m5.16x'

You can see the generated answer is wrong or doesn’t make much sense.

Step 3. Improve the answer to the same question using prompt engineering with insightful context

To better answer the question well, we provide extra contextual information, combine it with a prompt, and send it to model together with the question. Below is an example.

[76]:

context = """Managed Spot Training can be used with all instances
supported in Amazon SageMaker. Managed Spot Training is supported
in all AWS Regions where Amazon SageMaker is currently available."""

[105]:

prompt_template = """Answer the following QUESTION based on the CONTEXT
given. If you do not know the answer and the CONTEXT doesn't
contain the answer truthfully say "I don't know".

CONTEXT:
{context}


ANSWER:
"""

text_input = prompt_template.replace("{context}", context).replace("{question}", question)

payload = {
    "inputs":
      [
        [
         {"role": "system", "content": text_input},
         {"role": "user", "content": question},
        ]
      ],
   "parameters":{"max_new_tokens": 64, "top_p": 0.9, "temperature": 0.6, "return_full_text": False}
}

out = predictor.predict(payload, custom_attributes='accept_eula=true')
generated_text = out[0]['generation']['content']
print(f"[Input]: {question}\n[Output]: {generated_text}")

[Input]: Which instances can I use with Managed Spot Training in SageMaker?
[Output]:  Based on the given context, you can use Managed Spot Training with all instances supported in Amazon SageMaker. Therefore, the answer is:

All instances supported in Amazon SageMaker.

Let’s see if our LLM is capable of following our instructions…

[82]:

unanswerable_question = "What color is my desk?"

text_input = prompt_template.replace("{context}", context).replace("{question}", question)

payload = {
    "inputs":
      [
        [
         {"role": "system", "content": text_input},
         {"role": "user", "content": unanswerable_question},
        ]
      ],
   "parameters":{"max_new_tokens":256, "top_p":0.9, "temperature":0.6}
}


out = predictor.predict(payload, custom_attributes='accept_eula=true')
generated_text = out[0]['generation']['content']
print(f"[Input]: {unanswerable_question}\n[Output]: {generated_text}")

[Input]: What color is my desk?
[Output]:  I don't know the answer to your question about the color of your desk as it is not related to the context provided, which is about Amazon SageMaker and its supported instances and regions.

Looks great! The LLM is following instructions and we’ve also demonstrated how contexts can help our LLM answer questions accurately. However, we’re unlikely to be inserting a context directly into a prompt like this unless we already know the answer — and if we already know the answer why would we be asking the question at all?

We need a way of extracting relevant contexts from huge bases of information. For that we need Retrieval Augmented Generation (RAG).

Step 4. Use RAG based approach to identify the correct documents, and use them along with prompt and question to query LLM

We plan to use document embeddings to fetch the most relevant documents in our document knowledge library and combine them with the prompt that we provide to LLM.

To achieve that, we will do following.

Generate embedings for each of document in the knowledge library with the MiniLM embedding model.
Identify top K most relevant documents based on user query.
- For a query of your interest, generate the embedding of the query using the same embedding model.
- Search the indexes of top K most relevant documents in the embedding space using the SageMaker KNN algorithm.
- Use the indexes to retrieve the corresponded documents.
Combine the retrieved documents with prompt and question and send them into LLM.

Note: The retrieved document/text should be large enough to contain enough information to answer a question; but small enough to fit into the LLM prompt – maximum sequence length of 1024 tokens.

4.1 Deploying the model endpoint for Sentence Transformer embedding model

[25]:

hub_config = {
    "HF_MODEL_ID": "sentence-transformers/all-MiniLM-L6-v2",  # model_id from hf.co/models
    "HF_TASK": "feature-extraction",
}

huggingface_model = HuggingFaceModel(
    env=hub_config,
    role=role,
    transformers_version="4.6",  # transformers version used
    pytorch_version="1.7",  # pytorch version used
    py_version="py36",  # python version of the DLC
)

Then we deploy the model as we did earlier for our generative LLM:

[26]:

encoder = huggingface_model.deploy(
    initial_instance_count=1, instance_type="ml.t2.large", endpoint_name="minilm-embedding"
)

----!

We can then create the embeddings like so:

[27]:

out = encoder.predict({"inputs": ["some text here", "some more text goes here too"]})

We will see that we have two outputs (one for each of our input sentences):

[12]:

len(out)

[12]:

But if we look at each of these outputs we see something strange…

[13]:

len(out[0]), len(out[1])

[13]:

(8, 8)

We would expect the embeddings to be of dimensionality 384, but we’re seeing two lists containing eight items each? What is happening here?

When we output feature embeddings from the MiniLM model we’re actually outputting a single 384-dimensional vector for every token contained in the inputs we provided. Our second text "some more text goes here too" contains eight tokens, and so this is where the value 8 is coming from.

So, if we were to take a look at one of these vectors we should find the dimensionality of 384:

[14]:

len(out[0][0])

[14]:

Perfect! There’s just one problem, how do we transform these eight vector embeddings into a single sentence embedding? For this, we simply take the mean value across each vector dimension, like so:

[28]:

import numpy as np

embeddings = np.mean(np.array(out), axis=1)
embeddings.shape

[28]:

(2, 384)

Now we have two 384-dimensional vector embeddings, one for each of our input texts. To make our lives easier later, we will wrap this encoding process into a single function:

[29]:

from typing import List


def embed_docs(docs: List[str]) -> List[List[float]]:
    out = encoder.predict({"inputs": docs})
    embeddings = np.mean(np.array(out), axis=1)
    return embeddings.tolist()

4.2. Generate embeddings for each of document in the knowledge library with the Sentence Transformer model.

For the purpose of the demo we will use Amazon SageMaker FAQs as knowledge library. The data are formatted in a CSV file with two columns Question and Answer. We use only the Answer column as the documents of knowledge library, from which relevant documents are retrieved based on a query.

Each row in the CSV format dataset corresponds to a textual document. We will iterate each document to get its embedding vector via the MiniLM embedding model. For your purpose, you can replace the example dataset of your own to build a custom question and answering application.

First, we download the dataset from our S3 bucket to the local.

[10]:

s3_path = f"s3://jumpstart-cache-prod-us-east-2/training-datasets/Amazon_SageMaker_FAQs/Amazon_SageMaker_FAQs.csv"

[11]:

# Downloading the Database
!aws s3 cp $s3_path Amazon_SageMaker_FAQs.csv

download: s3://jumpstart-cache-prod-us-east-2/training-datasets/Amazon_SageMaker_FAQs/Amazon_SageMaker_FAQs.csv to ./Amazon_SageMaker_FAQs.csv

Open the dataset with Pandas:

[12]:

import pandas as pd

df_knowledge = pd.read_csv("Amazon_SageMaker_FAQs.csv", header=None, names=["Question", "Answer"])
df_knowledge.head()

[12]:

	Question	Answer
0	What is Amazon SageMaker?	Amazon SageMaker is a fully managed service to...
1	In which Regions is Amazon SageMaker available...	For a list of the supported Amazon SageMaker A...
2	What is the service availability of Amazon Sag...	Amazon SageMaker is designed for high availabi...
3	How does Amazon SageMaker secure my code?	Amazon SageMaker stores code in ML storage vol...
4	What security measures does Amazon SageMaker h...	Amazon SageMaker ensures that ML model artifac...

Drop the Question column since it is not used in this notebook.

[13]:

df_knowledge.drop(["Question"], axis=1, inplace=True)
df_knowledge.head()

[13]:

	Answer
0	Amazon SageMaker is a fully managed service to...
1	For a list of the supported Amazon SageMaker A...
2	Amazon SageMaker is designed for high availabi...
3	Amazon SageMaker stores code in ML storage vol...
4	Amazon SageMaker ensures that ML model artifac...

Next we can initialize our connection to Pinecone. To do this we need a free API key.

[21]:

import pinecone
import os

# add Pinecone API key from app.pinecone.io
api_key = os.environ.get("PINECONE_API_KEY") or "YOUR_API_KEY"
# set Pinecone environment - find next to API key in console
env = os.environ.get("PINECONE_ENVIRONMENT") or "YOUR_ENV"

pinecone.init(api_key=api_key, environment=env)

/opt/conda/lib/python3.7/site-packages/pinecone/index.py:4: TqdmExperimentalWarning: Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)
  from tqdm.autonotebook import tqdm

List all present indexes associated with your key, should be empty on the first run

[15]:

pinecone.list_indexes()

[15]:

['jumpstart-minilm-l6',
 'retrieval-augmentation-aws-6j',
 'retrieval-augmentation-aws']

Now we create a new index called retrieval-augmentation-aws. It’s important that we align the index dimension and metric parameters with those required by the MiniLM model.

[30]:

import time

index_name = "llama-2-7b-example"

if index_name in pinecone.list_indexes():
    pinecone.delete_index(index_name)

pinecone.create_index(name=index_name, dimension=embeddings.shape[1], metric="cosine")
# wait for index to finish initialization
while not pinecone.describe_index(index_name).status["ready"]:
    time.sleep(1)

[31]:

pinecone.list_indexes()

[31]:

['jumpstart-minilm-l6',
 'llama-2-7b-example',
 'retrieval-augmentation-aws-6j',
 'retrieval-augmentation-aws']

Now we upsert the data, we will do this in batches of 128.

[32]:

from tqdm.auto import tqdm

batch_size = 2  # can increase but needs larger instance size otherwise instance runs out of memory
vector_limit = 1000

answers = df_knowledge[:vector_limit]
index = pinecone.Index(index_name)

for i in tqdm(range(0, len(answers), batch_size)):
    # find end of batch
    i_end = min(i + batch_size, len(answers))
    # create IDs batch
    ids = [str(x) for x in range(i, i_end)]
    # create metadata batch
    metadatas = [{"text": text} for text in answers["Answer"][i:i_end]]
    # create embeddings
    texts = answers["Answer"][i:i_end].tolist()
    embeddings = embed_docs(texts)
    # create records list for upsert
    records = zip(ids, embeddings, metadatas)
    # upsert to Pinecone
    index.upsert(vectors=records)

[33]:

# check number of records in the index
index.describe_index_stats()

[33]:

{'dimension': 384,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 154}},
 'total_vector_count': 154}

4.3 Combine the retrieved documents, prompt, and question to query the LLM

Now we’re ready begin querying our LLM with a Retrieval Augmented Generation (RAG) pipeline. Let’s see how this will work step-by-step first.

First we create our query embedding and use it to query Pinecone:

[65]:

# extract embeddings for the questions
query_vec = embed_docs(question)[0]

# query pinecone
res = index.query(query_vec, top_k=1, include_metadata=True)

# show the results
res

[65]:

{'matches': [{'id': '90',
              'metadata': {'text': 'Managed Spot Training can be used with all '
                                   'instances supported in Amazon '
                                   'SageMaker.\r\n'},
              'score': 0.881181657,
              'values': []}],
 'namespace': ''}

We get multiple relevant contexts here. We can use these to contruct a single context to feed into our LLM prompt.

[66]:

contexts = [match.metadata["text"] for match in res.matches]

[67]:

max_section_len = 1000
separator = "\n"


def construct_context(contexts: List[str]) -> str:
    chosen_sections = []
    chosen_sections_len = 0

    for text in contexts:
        text = text.strip()
        # Add contexts until we run out of space.
        chosen_sections_len += len(text) + 2
        if chosen_sections_len > max_section_len:
            break
        chosen_sections.append(text)
    concatenated_doc = separator.join(chosen_sections)
    print(
        f"With maximum sequence length {max_section_len}, selected top {len(chosen_sections)} document sections: \n{concatenated_doc}"
    )
    return concatenated_doc

[68]:

context_str = construct_context(contexts=contexts)

With maximum sequence length 1000, selected top 1 document sections:
Managed Spot Training can be used with all instances supported in Amazon SageMaker.

We would then feed this context_str into our LLama-2 prompt:

[78]:

def create_payload(question, context_str) -> dict:
    prompt_template = """Answer the following QUESTION based on the CONTEXT
    given. If you do not know the answer and the CONTEXT doesn't
    contain the answer truthfully say "I don't know".

    CONTEXT:
    {context}


    ANSWER:
    """

    text_input = prompt_template.replace("{context}", context_str).replace("{question}", question)

    payload = {
        "inputs":
          [
            [
             {"role": "system", "content": text_input},
             {"role": "user", "content": question},
            ]
          ],
       "parameters":{"max_new_tokens": 256, "top_p": 0.9, "temperature": 0.6, "return_full_text": False}
    }
    return(payload)

[79]:

payload = create_payload(question, context_str)
out = predictor.predict(payload, custom_attributes='accept_eula=true')
generated_text = out[0]['generation']['content']
print(f"[Input]: {question}\n[Output]: {generated_text}")

[Input]: Which instances can I use with Managed Spot Training in SageMaker?
[Output]:  Based on the context provided, you can use Managed Spot Training with all instances supported in Amazon SageMaker. Therefore, the answer is:

All instances supported in Amazon SageMaker.

Let’s place all of this logic into a single RAG query function:

[80]:

def rag_query(question: str) -> str:
    # create query vec
    query_vec = embed_docs(question)[0]
    # query pinecone
    res = index.query(query_vec, top_k=5, include_metadata=True)
    # get contexts
    contexts = [match.metadata["text"] for match in res.matches]
    # build the multiple contexts string
    context_str = construct_context(contexts=contexts)
    # create our retrieval augmented prompt
    payload = create_payload(question, context_str)
    # make prediction
    out = predictor.predict(payload, custom_attributes='accept_eula=true')
    return out[0]["generation"]["content"]

We can now ask the question:

[85]:

rag_query("Does SageMaker support spot instances?")

With maximum sequence length 1000, selected top 5 document sections:
Managed Spot Training can be used with all instances supported in Amazon SageMaker.
Managed Spot Training is supported in all AWS Regions where Amazon SageMaker is currently available.
Managed Spot Training with Amazon SageMaker lets you train your ML models using Amazon EC2 Spot instances, while reducing the cost of training your models by up to 90%.
For a list of the supported Amazon SageMaker AWS Regions, please visit the AWS Regional Services page. Also, for more information, see Regional endpoints in the AWS general reference guide.
At launch, we will support all Regions supported by Amazon SageMaker, except the AWS China Regions.

[85]:

' Yes, Amazon SageMaker supports spot instances for managed spot training. According to the provided context, Managed Spot Training can be used with all instances supported in Amazon SageMaker, and Managed Spot Training is supported in all AWS Regions where Amazon SageMaker is currently available.\n\nTherefore, the answer to your question is:\n\nYes, SageMaker supports spot instances in all regions where Amazon SageMaker is available.'

We can also ask questions about things that are out of context (not contained within our dataset). From this we expect the model to not hallucinate and honestly tell us that it does not know the answer:

[87]:

rag_query("Can I deploy a model trained outside of SageMaker?")

With maximum sequence length 1000, selected top 2 document sections:
No. Amazon SageMaker operates the compute infrastructure on your behalf, allowing it to perform health checks, apply security patches, and do other routine maintenance. You can also deploy the model artifacts from training with custom inference code in your own hosting environment.
Amazon SageMaker Data Wrangler provides a unified experience enabling you to prepare data and seamlessly train a machine learning model in Amazon SageMaker Autopilot. SageMaker Autopilot automatically builds, trains, and tunes the best ML models based on your data. With SageMaker Autopilot, you still maintain full control and visibility of your data and model. You can also use features prepared in SageMaker Data Wrangler with your existing models. You can configure Amazon SageMaker Data Wrangler processing jobs to run as part of your SageMaker training pipeline either by configuring the job in the user interface (UI) or exporting a notebook with the orchestration code.

[87]:

' Based on the context provided, the answer is "Yes, you can deploy a model trained outside of Amazon SageMaker."\n\nAccording to the text, Amazon SageMaker Data Wrangler provides a unified experience for preparing data and training machine learning models, including models trained outside of SageMaker. This suggests that SageMaker Autopilot can be used to deploy models trained outside of the platform, as long as they are in a format that can be processed by SageMaker.\n\nAdditionally, the text states that you can configure Amazon SageMaker Data Wrangler processing jobs to run as part of your SageMaker training pipeline, either through the user interface or by exporting a notebook with the orchestration code. This suggests that you can integrate models trained outside of SageMaker into your SageMaker workflows and deploy them using the platform\'s infrastructure.\n\nTherefore, based on the context provided, the answer is "Yes, you can deploy a model trained outside of Amazon SageMaker."'