BYOC LLM Monitoring: Bring Your Own Container Llama2 Monitoring with SageMaker Model Monitor
This notebook’s CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.
In this demo notebook, we demonstrate how to use the SageMaker Python SDK to deploy and monitor a JumpStart Llama 2 fine-tuned model for Toxicity levels. The container associated with this notebook employs the FMEval open-source library for LLM evaluation.
To perform inference on these models, you need to pass custom_attributes=‘accept_eula=true’ as part of header. This means you have read and accept the end-user-license-agreement (EULA) of the model. EULA can be found in model card description or from https://ai.meta.com/resources/models-and-libraries/llama-downloads/. By default, this notebook sets custom_attributes=‘accept_eula=false’, so all inference requests will fail until you explicitly change this custom attribute.
Note: Custom_attributes used to pass EULA are key/value pairs. The key and value are separated by ‘=’ and pairs are separated by ‘;’. If the user passes the same key more than once, the last value is kept and passed to the script handler (i.e., in this case, used for conditional logic). For example, if ‘accept_eula=false; accept_eula=true’ is passed to the server, then ‘accept_eula=true’ is kept and passed to the script handler.
Background
SageMaker Model Monitor allows users to provide images of their own custom-built containers to be run at each monitoring job. This notebook leverages the BYOC feature to monitor the Llama2-7b model for 7 different Toxicity levels.
Prerequisites
IF RUNNING LOCALLY (not SageMaker Studio/Classic): An IAM role that gives SageMakerFullAccess. This role must also include the AmazonEC2ContainerRegistryFullAccess permission in order to push container image to ECR and the CloudWatchFullAccess permission to create CloudWatch Dashboards. By default, the SageMaker Execution Role associated with Sagemaker Studio instances do not have these permissions; you must manually attach them. For information on how to complete this, see this documentation
IF RUNNING ON SAGEMAKER STUDIO/STUDIO CLASSIC (not locally): An IAM role that gives SageMakerFullAccess. This role must also include the AmazonEC2ContainerRegistryFullAccess permission in order to push container image to ECR and the CloudWatchFullAccess permission to create CloudWatch Dashboards. By default, the SageMaker Execution Role associated with Sagemaker Studio instances do not have these permissions; you must manually attach them. For information on how to complete this, see this documentation. Please also ensure that Docker access is enabled in your domain and that you have downloaded Docker for this notebook instance. Please follow the guide at the end of this notebook to complete Docker setup.
Setup
This notebook is best suited for a kernel of python verion >= 3.11
[ ]:
%pip install -r requirements.txt
Retreive your SageMaker Session and Configure Execution Role
[ ]:
import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket = None
if sagemaker_session_bucket is None and sess is not None:
sagemaker_session_bucket = sess.default_bucket()
# Here, we create a role for SageMaker. The role ARN must be specified when calling the predict() method. If this fails, you can manually specify the role ARN in the except block.
try:
role = sagemaker.get_execution_role()
except ValueError:
iam = boto3.client("iam")
# Manually specify the role ARN. Ensure that this role has the 'AmazonSageMakerFullAccess' role. See the linked documentation for help.
role = iam.get_role(RoleName="<CustomRoleName>")["Role"]["Arn"]
sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)
print(f"sagemaker role arn: {role}")
print(f"sagemaker session region: {sess.boto_region_name}")
You can continue with the default model or choose a different model: this notebook will run with the following model IDs : - meta-textgeneration-llama-2-7b-f - meta-textgeneration-llama-2-13b-f - meta-textgeneration-llama-2-70b-f ***
[ ]:
model_id, model_version = "meta-textgeneration-llama-2-7b-f", "2.*"
Deploy model
You can now deploy the model using SageMaker JumpStart. ***
Set up DataCapture
[ ]:
bucket = sess.default_bucket()
print("Demo Bucket:", bucket)
[ ]:
from sagemaker.model_monitor import DataCaptureConfig
s3_root_dir = "byoc-monitor-llm"
s3_capture_upload_path = f"s3://{bucket}/{s3_root_dir}/datacapture"
data_capture_config = DataCaptureConfig(
enable_capture=True, sampling_percentage=100, destination_s3_uri=s3_capture_upload_path
)
[ ]:
print(s3_capture_upload_path)
Deploy JumpStart Model
Note: This will take roughly 10 mins
[ ]:
from sagemaker.jumpstart.model import JumpStartModel
model = JumpStartModel(model_id=model_id, model_version=model_version, role=role)
predictor = model.deploy(data_capture_config=data_capture_config)
print(model.endpoint_name)
Invoke the endpoint
Supported Parameters
This model supports the following inference payload parameters:
max_new_tokens: Model generates text until the output length (excluding the input context length) reaches max_new_tokens. If specified, it must be a positive integer.
temperature: Controls the randomness in the output. Higher temperature results in output sequence with low-probability words and lower temperature results in output sequence with high-probability words. If
temperature-> 0, it results in greedy decoding. If specified, it must be a positive float.top_p: In each step of text generation, sample from the smallest possible set of words with cumulative probability
top_p. If specified, it must be a float between 0 and 1.
You may specify any subset of the parameters mentioned above while invoking an endpoint.
Notes
If
max_new_tokensis not defined, the model may generate up to the maximum total tokens allowed, which is 4K for these models. This may result in endpoint query timeout errors, so it is recommended to setmax_new_tokenswhen possible. For 7B, 13B, and 70B models, we recommend to setmax_new_tokensno greater than 1500, 1000, and 500 respectively, while keeping the total number of tokens less than 4K.In order to support a 4k context length, this model has restricted query payloads to only utilize a batch size of 1. Payloads with larger batch sizes will receive an endpoint error prior to inference.
This model only supports ‘system’, ‘user’ and ‘assistant’ roles, starting with ‘system’, then ‘user’ and alternating (u/a/u/a/u…).
[ ]:
def print_dialog(payload, response):
dialog = payload["inputs"][0]
for msg in dialog:
print(f"{msg['role'].capitalize()}: {msg['content']}\n")
print(
f">>>> {response[0]['generation']['role'].capitalize()}: {response[0]['generation']['content']}"
)
print("\n==================================\n")
Single invocation
NOTE: Read the end-user-license-agreement here https://ai.meta.com/resources/models-and-libraries/llama-downloads/ and accept by setting accept_eula to true, otherwise an error will be raised.
[ ]:
payload = {
"inputs": [
[
{"role": "user", "content": "what is the recipe of mayonnaise?"},
]
],
"parameters": {"max_new_tokens": 512, "top_p": 0.9, "temperature": 0.6},
}
try:
response = predictor.predict(payload, custom_attributes="accept_eula=false")
print_dialog(payload, response)
except Exception as e:
print(e)
Send artificial traffic to the endpoint.
The following cell will send 10 queries to the endpoint. Feel free to adjust the number of queries to whatever amount you feel is enough captured data.
NOTE: Read the end-user-license-agreement here https://ai.meta.com/resources/models-and-libraries/llama-downloads/ and accept by setting accept_eula to true
[ ]:
import json
line_count = 0
with open("./data/questions.jsonl", "r") as datafile:
for line in datafile:
if line_count == 10:
break
line_count += 1
data = json.loads(line)
payload = {
"inputs": [
[
data,
]
],
"parameters": {"max_new_tokens": 512, "top_p": 0.9, "temperature": 0.6},
}
try:
response = predictor.predict(payload, custom_attributes="accept_eula=false")
print_dialog(payload, response)
except Exception as e:
print(e)
Build and Push the Image to ECR
[ ]:
ecr_repo_name = "byoc-llm"
aws_region = sess.boto_region_name
aws_account_id = sess.account_id()
Build the image. This will take some time.
[ ]:
!set -Eeuxo pipefail
!docker build -t "{ecr_repo_name}" . --network sagemaker
Create the repository. Ensure the role you have assumed has the AmazonEC2ContainerRegistryFullAccess permission attached.
[ ]:
ecr = boto3.client("ecr")
try:
response = ecr.create_repository(
repositoryName=ecr_repo_name,
imageTagMutability="MUTABLE",
imageScanningConfiguration={"scanOnPush": False},
)
except ecr.exceptions.RepositoryAlreadyExistsException:
print(f"Repository {ecr_repo_name} already exists. Skipping creation.")
Push the image to ECR. This will take some time, as we are pushing a ~9GB image. Ensure that your AWS credentials are fresh.
[ ]:
!LATEST_IMAGE_ID=$(docker images --filter=reference='{ecr_repo_name}:latest' --format "{{.ID}}" | head -n 1)
!echo $LATEST_IMAGE_ID
!aws ecr get-login-password --region '{aws_region}' | docker login --username AWS --password-stdin '{aws_account_id}'.dkr.ecr.'{aws_region}'.amazonaws.com
!docker tag '{ecr_repo_name}':latest '{aws_account_id}'.dkr.ecr.'{aws_region}'.amazonaws.com/'{ecr_repo_name}':latest
!echo 'Pushing to ECR Repo: ''{aws_account_id}'.dkr.ecr.'{aws_region}'.amazonaws.com/'{ecr_repo_name}':latest
!docker push '{aws_account_id}'.dkr.ecr.'{aws_region}'.amazonaws.com/'{ecr_repo_name}':latest
Set a Monitoring Schedule
[ ]:
from sagemaker.model_monitor import ModelMonitor
image_uri = f"{aws_account_id}.dkr.ecr.{aws_region}.amazonaws.com/{ecr_repo_name}:latest"
bucket = sess.default_bucket()
monitor = ModelMonitor(
base_job_name="byoc-llm-monitor",
role=role,
image_uri=image_uri,
instance_count=1,
instance_type="ml.m5.2xlarge",
env={"bucket": bucket},
)
Note: The following cell sets a one-time monitoring schedule for demonstration purposes. A one-time monitoring schedule will execute immediately. If you would like to set an hourly schedule, swap out the commented line. It is important to know that hourly schedules will only begin at the start of the next full hour, so you will not see immediate results.
[ ]:
from sagemaker.model_monitor import CronExpressionGenerator, MonitoringOutput, EndpointInput
# Do not change
container_data_destination = "/opt/ml/processing/input_data"
container_evaluation_source = "/opt/ml/processing/output"
s3_report_upload_path = f"s3://{bucket}/{s3_root_dir}/results"
endpoint_input = EndpointInput(
endpoint_name=predictor.endpoint_name,
destination=container_data_destination,
)
monitor.create_monitoring_schedule(
endpoint_input=endpoint_input,
output=MonitoringOutput(source=container_evaluation_source, destination=s3_report_upload_path),
schedule_cron_expression=CronExpressionGenerator.now(), # CronExpressionGenerator.hourly()
# data sampling is from 3hrs prior to execution to time of execution
data_analysis_start_time="-PT3H",
data_analysis_end_time="-PT0H",
)
View Results
The following cell prints the output report stored in Amazon S3. It includes evaluations for at most 100 samples of the captured data.
NOTE: The report will show up once the job is finished. Please try again in a few minutes.
[ ]:
from sagemaker import s3
try:
execution_output = monitor.list_executions()[-1].output
s3_path_to_report = f"{execution_output.destination}/toxicity_custom_dataset.jsonl"
print(s3.S3Downloader.read_file(s3_path_to_report))
except:
print("Report not found. Please wait and try again.")
The following cell will generate a CloudWatch Dashboard for viewing the evaluation results from the monitoring schedule you ran. For more information on dashboard formatting, see here
[ ]:
cwClient = boto3.client("cloudwatch")
monitoring_schedule_name = monitor.describe_schedule()["MonitoringScheduleName"]
endpoint_name = monitor.describe_schedule()["EndpointName"]
# Get the metrics for this monitoring schedule
metric_list = cwClient.list_metrics(
Dimensions=[
{"Name": "Endpoint", "Value": endpoint_name},
{"Name": "MonitoringSchedule", "Value": monitoring_schedule_name},
],
)
metric_names = [metric["MetricName"] for metric in metric_list["Metrics"]]
[ ]:
linear_interpolate_metric = [
{
"expression": "FILL(METRICS(), LINEAR)",
"label": "Linear Interpolated",
"id": "e1",
"region": sess.boto_region_name,
}
]
metrics = [linear_interpolate_metric]
for i, metric_name in enumerate(metric_names):
metrics.append(
[
"aws/sagemaker/Endpoints/data-metrics",
metric_name,
"Endpoint",
endpoint_name,
"MonitoringSchedule",
monitoring_schedule_name,
{"id": f"m{i+1}", "region": sess.boto_region_name, "visible": False},
]
)
widget_title = "LLM Evaluation Graph"
dash_data = json.dumps(
{
"start": "-PT6H",
"periodOverride": "inherit",
"widgets": [
{
"type": "metric",
"x": 0,
"y": 0,
"width": 13,
"height": 10,
"properties": {
"metrics": metrics,
"view": "timeSeries",
"stacked": False,
"region": sess.boto_region_name,
"stat": "Average",
"period": 300,
"title": widget_title,
},
},
{
"type": "text",
"x": 13,
"y": 0,
"width": 11,
"height": 11,
"properties": {
"markdown": "# LLM Evaluation Descriptions\n## Toxicity\nToxicity is measured in 7 different categories:\n- `toxicity`\n- `severe_toxicity`\n- `obscene`\n- `threat`\n- `insult`\n- `identity_attack`\n- `sexual_explicit`\n\nEach score is a number between 0 and 1, with 1 denoting extreme toxicity. To obtain the toxicity scores, the FMEval library uses the open-source [Detoxify](https://github.com/unitaryai/detoxify) model to grade each LLM output."
},
},
],
}
)
dashboard_name = "byoc-llm-monitoring"
cwClient.put_dashboard(DashboardName=dashboard_name, DashboardBody=dash_data)
Click the link from the following cell output to view the created CloudWatch Dashboard
[ ]:
from IPython.display import display, Markdown
display(
Markdown(
f"[CloudWatch Dashboard](https://{aws_region}.console.aws.amazon.com/cloudwatch/home?region={aws_region}#dashboards/dashboard/{dashboard_name})"
)
)
[ ]:
import time
# Delete monitoring job
name = monitor.monitoring_schedule_name
monitor.delete_monitoring_schedule()
# Waits until monitoring schedule has been deleted to delete endpoint
while True:
monitoring_schedules = sess.list_monitoring_schedules()
if any(
schedule["MonitoringScheduleName"] == name
for schedule in monitoring_schedules["MonitoringScheduleSummaries"]
):
time.sleep(5)
else:
print("Monitoring schedule deleted")
break
sess.delete_endpoint(endpoint_name=predictor.endpoint_name) # delete model endpoint
SageMaker Studio Docker Guide
To set up docker in your SageMaker studio environment, follow these steps: 1. Run the following command in the AWS CLI, inputting your region and SageMaker domain ID:
aws --region <region> \
sagemaker update-domain --domain-id <domain-id> \
--domain-settings-for-update '{"DockerSettings": {"EnableDockerAccess": "ENABLED"}}'
Open a new notebook instance. Only instances created after running this command will have Docker access.
Open the terminal in this new instance and follow the installation directions
Notebook CI Test Results
This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.