Amazon SageMaker Clarify Model Bias Monitor for Batch Transform - JSON Lines Format


This notebook’s CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

This us-west-2 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable


Runtime

This notebook takes approximately 60 minutes to run.

Contents

Introduction

Amazon SageMaker Model Monitor continuously monitors the quality of Amazon SageMaker machine learning models in production. It enables developers to set alerts for when there are deviations in the model quality. Early and pro-active detection of these deviations enables corrective actions, such as retraining models, auditing upstream systems, or fixing data quality issues without having to monitor models manually or build additional tooling.

Amazon SageMaker Clarify Model Bias Monitor is a model monitor that helps data scientists and ML engineers monitor predictions for bias on a regular basis. Bias can be introduced or exacerbated in deployed ML models when the training data differs from the data that the model sees during deployment (that is, the live data). These kinds of changes in the live data distribution might be temporary (for example, due to some short-lived, real-world events) or permanent. In either case, it might be important to detect these changes. For example, the outputs of a model for predicting home prices can become biased if the mortgage rates used to train the model differ from current, real-world mortgage rates. With bias drift detection capabilities in model monitor, when SageMaker detects bias beyond a certain threshold, it automatically generates metrics that you can view in SageMaker Studio and through Amazon CloudWatch alerts.

This notebook demonstrates the process for setting up a model monitor for continuous monitoring of bias drift of the data and model used by a regularly running SageMaker Batch Transform job. The model input and output are in SageMaker JSON Lines dense format.

In general, you can use the model bias monitor for batch transform in this way,

  1. Schedule a model bias monitor to monitor a data capture S3 location and a ground truth S3 location

  2. Regularly run transform jobs with data capture enabled, the jobs save captured data to the data capture S3 URI

  3. Regularly label the captured data, and then upload the ground truth labels to the ground truth S3 URI

The monitor executes processing jobs regularly to merge the captured data and ground truth data, do bias analysis for the merged data, and then generate analysis reports and publish metrics to CloudWatch.

General Setup

The notebook uses the SageMaker Python SDK. The following cell upgrades the SDK and its dependencies. Then you may need to restart the kernel and rerun the notebook to pick up the up-to-date APIs, if the notebook is executed in the SageMaker Studio.

[ ]:
!pip install -U sagemaker
!pip install -U boto3
!pip install -U botocore

Imports

The following cell imports the APIs to be used by the notebook.

[2]:
import sagemaker
import pandas as pd
import datetime
import json
import os
import pprint
import random
import time

Handful of configuration

To begin, ensure that these prerequisites have been completed.

  • Specify an AWS Region to host the model.

  • Specify an IAM role to execute jobs.

  • Define the S3 URIs that stores the model file, input data and output data. For demonstration purposes, this notebook uses the same bucket for them. In reality, they could be separated with different security policies.

[3]:
sagemaker_session = sagemaker.Session()

region = sagemaker_session.boto_region_name
print(f"AWS region: {region}")

role = sagemaker.get_execution_role()
print(f"RoleArn: {role}")

# A different bucket can be used, but make sure the role for this notebook has
# the s3:PutObject permissions. This is the bucket into which the data is captured
bucket = sagemaker_session.default_bucket()
print(f"Demo Bucket: {bucket}")
prefix = sagemaker.utils.unique_name_from_base("sagemaker/DEMO-ClarifyModelMonitor")
print(f"Demo Prefix: {prefix}")
s3_key = f"s3://{bucket}/{prefix}"
print(f"Demo S3 key: {s3_key}")

data_capture_s3_uri = f"{s3_key}/data-capture"
ground_truth_s3_uri = f"{s3_key}/ground-truth"
transform_output_s3_uri = f"{s3_key}/transform-output"
baselining_output_s3_uri = f"{s3_key}/baselining-output"
monitor_output_s3_uri = f"{s3_key}/monitor-output"

print(f"The transform job will save the results to: {transform_output_s3_uri}")
print(f"The transform job will save the captured data to: {data_capture_s3_uri}")
print(f"You should upload the ground truth data to: {ground_truth_s3_uri}")
print(f"The baselining job will save the analysis results to: {baselining_output_s3_uri}")
print(f"The monitor will save the analysis results to: {monitor_output_s3_uri}")
AWS region: us-west-2
RoleArn: arn:aws:iam::000000000000:role/service-role/AmazonSageMaker-ExecutionRole-20200714T163791
Demo Bucket: sagemaker-us-west-2-000000000000
Demo Prefix: sagemaker/DEMO-ClarifyModelMonitor-1674264462-cc75
Demo S3 key: s3://sagemaker-us-west-2-000000000000/sagemaker/DEMO-ClarifyModelMonitor-1674264462-cc75
The transform job will save the results to: s3://sagemaker-us-west-2-000000000000/sagemaker/DEMO-ClarifyModelMonitor-1674264462-cc75/transform-output
The transform job will save the captured data to: s3://sagemaker-us-west-2-000000000000/sagemaker/DEMO-ClarifyModelMonitor-1674264462-cc75/data-capture
You should upload the ground truth data to: s3://sagemaker-us-west-2-000000000000/sagemaker/DEMO-ClarifyModelMonitor-1674264462-cc75/ground-truth
The baselining job will save the analysis results to: s3://sagemaker-us-west-2-000000000000/sagemaker/DEMO-ClarifyModelMonitor-1674264462-cc75/baselining-output
The monitor will save the analysis results to: s3://sagemaker-us-west-2-000000000000/sagemaker/DEMO-ClarifyModelMonitor-1674264462-cc75/monitor-output

Data files

This example includes two dataset files, both in the JSON Lines format.

[4]:
train_dataset_path = "test_data/validation-dataset.jsonl"
test_dataset_path = "test_data/test-dataset.jsonl"
dataset_type = "application/jsonlines"

The train dataset has the features and the ground truth label (pointed to by the key “label”),

[5]:
!head -n 5 $train_dataset_path
{"features":[41,2,220531,14,15,2,9,0,4,1,0,0,60,38],"label":1}
{"features":[33,2,35378,9,13,2,11,5,4,0,0,0,45,38],"label":1}
{"features":[36,2,223433,12,14,2,11,0,4,1,7688,0,50,38],"label":1}
{"features":[40,2,220589,7,12,4,0,1,4,0,0,0,40,38],"label":0}
{"features":[30,2,231413,15,10,2,2,0,4,1,0,0,40,38],"label":1}

The test dataset only has features.

[6]:
!head -n 5 $test_dataset_path
{"features":[28,2,133937,9,13,2,0,0,4,1,15024,0,55,37]}
{"features":[43,2,72338,12,14,2,12,0,1,1,0,0,40,37]}
{"features":[34,2,162604,11,9,4,2,2,2,1,0,0,40,37]}
{"features":[20,2,258509,11,9,4,6,3,2,1,0,0,40,37]}
{"features":[27,2,446947,9,13,4,0,4,2,0,0,0,55,37]}

Here are the headers of the train dataset. “Target” is the header of the ground truth label, and the others are the feature headers. They will be used to beautify the analysis report.

[7]:
all_headers = [
    "Age",
    "Workclass",
    "fnlwgt",
    "Education",
    "Education-Num",
    "Marital Status",
    "Occupation",
    "Relationship",
    "Ethnic group",
    "Sex",
    "Capital Gain",
    "Capital Loss",
    "Hours per week",
    "Country",
    "Target",
]

To verify that the execution role for this notebook has the necessary permissions to proceed, put a simple test object into the S3 bucket specified above. If this command fails, update the role to have s3:PutObject permission on the bucket and try again.

[8]:
sagemaker.s3.S3Uploader.upload_string_as_file_body(
    body="hello",
    desired_s3_uri=f"{s3_key}/upload-test-file.txt",
    sagemaker_session=sagemaker_session,
)
print("Success! We are all set to proceed with uploading to S3.")
Success! We are all set to proceed with uploading to S3.

Then upload the data files to S3 so that they can be used by SageMaker.

[9]:
train_data_s3_uri = sagemaker.s3.S3Uploader.upload(
    local_path=train_dataset_path,
    desired_s3_uri=s3_key,
    sagemaker_session=sagemaker_session,
)
print(f"Train data is uploaded to: {train_data_s3_uri}")
test_data_s3_uri = sagemaker.s3.S3Uploader.upload(
    local_path=test_dataset_path,
    desired_s3_uri=s3_key,
    sagemaker_session=sagemaker_session,
)
print(f"Test data is uploaded to: {test_data_s3_uri}")
Train data is uploaded to: s3://sagemaker-us-west-2-000000000000/sagemaker/DEMO-ClarifyModelMonitor-1674264462-cc75/validation-dataset.jsonl
Test data is uploaded to: s3://sagemaker-us-west-2-000000000000/sagemaker/DEMO-ClarifyModelMonitor-1674264462-cc75/test-dataset.jsonl

SageMaker model

This example includes a prebuilt SageMaker Linear Learner model trained by a SageMaker Clarify offline processing example notebook. The model supports SageMaker JSON Lines dense format (MIME type "application/jsonlines").

  • The model input can one or more lines, each line is a JSON object that has a “features” key pointing to a list of feature values concerning demographic characteristics of individuals. For example,

{"features":[28,2,133937,9,13,2,0,0,4,1,15024,0,55,37]}
{"features":[43,2,72338,12,14,2,12,0,1,1,0,0,40,37]}
  • The model output has the predictions of whether a person has a yearly income that is more than $50,000. Each prediction is a JSON object that has a “predicted_label” key pointing to the predicted label, and the “score” key pointing to the confidence score. For example,

{"predicted_label":1,"score":0.989977359771728}
{"predicted_label":1,"score":0.504138827323913}
[10]:
model_file = "model/ll-adult-prediction-model.tar.gz"
model_url = sagemaker.s3.S3Uploader.upload(
    local_path=model_file,
    desired_s3_uri=s3_key,
    sagemaker_session=sagemaker_session,
)
print(f"Model file has been uploaded to {model_url}")

model_name = sagemaker.utils.unique_name_from_base("DEMO-xgb-churn-pred-model-monitor")
print(f"SageMaker model name: {model_name}")

image_uri = sagemaker.image_uris.retrieve("linear-learner", region, "1")
print(f"SageMaker Linear Learner image: {image_uri}")

model = sagemaker.model.Model(image_uri=image_uri, model_data=model_url, role=role)
container_def = model.prepare_container_def()
sagemaker_session.create_model(model_name, role, container_def)
print("SageMaker model created")
Model file has been uploaded to s3://sagemaker-us-west-2-000000000000/sagemaker/DEMO-ClarifyModelMonitor-1674264462-cc75/ll-adult-prediction-model.tar.gz
SageMaker model name: DEMO-xgb-churn-pred-model-monitor-1674264462-1d33
SageMaker Linear Learner image: 174872318107.dkr.ecr.us-west-2.amazonaws.com/linear-learner:1
SageMaker model created

Batch Transform Job

For continuous monitoring, batch transform jobs should be executed regularly with the latest data. But for demonstration purpose, the following cell only executes the job once before the monitor is scheduled, so that the first monitoring execution has captured data to process.

See Transformer for the API reference. Highlights,

  • destination_s3_uri is used to specify the data capture S3 URI which is a key connection between the job and the monitor.

  • join_source must be set to “Input” for the transform output to include predictions (model output) as well as features (model input), because model bias monitor requires both.

  • generate_inference_id must be set to True for the transform output to include a unique ID for each record. Model bias monitor requires both predicted labels and ground truth labels, so it needs the ID to join the captured data and the ground truth data.

NOTE: The following cell takes about 5 minutes to run.

[11]:
transfomer = model.transformer(
    instance_count=1,
    instance_type="ml.m5.xlarge",
    accept=dataset_type,  # The transform output data format
    assemble_with="Line",  # JSON Lines records are terminated by new lines
    output_path=transform_output_s3_uri,
)

transfomer.transform(
    data=test_data_s3_uri,
    content_type=dataset_type,  # The transform input format
    split_type="Line",  # JSON Lines records are terminated by new lines
    join_source="Input",  # Include model input (features) in transform output
    batch_data_capture_config=sagemaker.inputs.BatchDataCaptureConfig(
        destination_s3_uri=data_capture_s3_uri,
        generate_inference_id=True,  # Inference ID is mandatory to join the captured data and the ground truth data
    ),
    wait=True,  # In real world you don't have to wait, but for demo purpose we wait for the output
    logs=False,  # You can change it to True to view job logs inline
)
.............................................................!

Captured data

Once the transform job completed, an “output” folders is created under data_capture_s3_uri, to includes the captured data files of transform output. Note that, batch transform data capture is unlike endpoint data capture, it does not capture the data for real as it will create tremendous amount of duplications. Instead, it generates manifest files which refer to the transform output S3 location.

Now list the captured data files stored in Amazon S3. There should be different files from different time periods organized based on the hour in which the batch transformation occurred. The format of the Amazon S3 path is:

s3://{data_capture_s3_uri}/output/yyyy/mm/dd/hh/filename.jsonl

[12]:
data_capture_output = f"{data_capture_s3_uri}/output"
captured_data_files = sorted(
    sagemaker.s3.S3Downloader.list(
        s3_uri=data_capture_output,
        sagemaker_session=sagemaker_session,
    )
)
print("Found capture data files:")
print("\n ".join(captured_data_files[-5:]))
Found capture data files:
s3://sagemaker-us-west-2-000000000000/sagemaker/DEMO-ClarifyModelMonitor-1674264462-cc75/data-capture/output/2023/01/21/01/e824d723-e1f2-4986-90fc-0b0ac58ea0fc.json
[13]:
data_capture_output_dict = json.loads(
    sagemaker.s3.S3Downloader.read_file(
        s3_uri=captured_data_files[-1],
        sagemaker_session=sagemaker_session,
    )
)
print(json.dumps(data_capture_output_dict, indent=4))
[
    {
        "prefix": "s3://sagemaker-us-west-2-000000000000/sagemaker/DEMO-ClarifyModelMonitor-1674264462-cc75/transform-output/"
    },
    "test-dataset.jsonl.out"
]

Transform output

The captured data file refers to the transform output .out file.

[14]:
transform_output = os.path.join(data_capture_output_dict[0]["prefix"], data_capture_output_dict[1])
transform_output
[14]:
's3://sagemaker-us-west-2-000000000000/sagemaker/DEMO-ClarifyModelMonitor-1674264462-cc75/transform-output/test-dataset.jsonl.out'

View the content of the capture file.

[15]:
transform_output_content = sagemaker.s3.S3Downloader.read_file(
    s3_uri=transform_output,
    sagemaker_session=sagemaker_session,
).splitlines()
print(*transform_output_content[-5:], sep="\n")
{"SageMakerInferenceId":"89180035-821a-41d9-87b2-67a22490de94","SageMakerInferenceTime":"2023-01-21T01:29:04Z","SageMakerOutput":{"predicted_label":0,"score":0.047645717859268},"features":[29,2,178610,15,10,0,7,4,4,0,0,0,21,37]}
{"SageMakerInferenceId":"909383e8-fbe7-4667-8656-176b664409fb","SageMakerInferenceTime":"2023-01-21T01:29:04Z","SageMakerOutput":{"predicted_label":0,"score":0.237941488623619},"features":[49,2,96854,11,9,0,7,4,4,1,0,0,40,37]}
{"SageMakerInferenceId":"e8c6bb5e-e4c7-4ee2-8f09-9136fcf1aaed","SageMakerInferenceTime":"2023-01-21T01:29:04Z","SageMakerOutput":{"predicted_label":0,"score":0.333132773637771},"features":[45,2,293628,15,10,2,9,0,4,1,0,0,50,28]}
{"SageMakerInferenceId":"fba6af0c-478a-45fb-9b7b-15fb96c2f6bb","SageMakerInferenceTime":"2023-01-21T01:29:04Z","SageMakerOutput":{"predicted_label":0,"score":0.321518242359161},"features":[67,2,192995,11,9,6,0,4,4,0,6723,0,40,37]}
{"SageMakerInferenceId":"c8ba099f-06cc-4baf-af60-db8fffa4b58e","SageMakerInferenceTime":"2023-01-21T01:29:04Z","SageMakerOutput":{"predicted_label":0,"score":0.050630431622266},"features":[30,2,235847,9,13,4,7,3,4,0,0,0,24,37]}

The contents of a single line is present below in formatted JSON to observe a little better.

  • The features are captured because the join_source parameter is set to “Input”.

  • The predictions are captured into the "SageMakerOutput" field.

  • The inference ID and inference time (the start time of the transform job) are also captured because the generate_inference_id parameter is set to True.

[16]:
print(json.dumps(json.loads(transform_output_content[-1]), indent=4))
{
    "SageMakerInferenceId": "c8ba099f-06cc-4baf-af60-db8fffa4b58e",
    "SageMakerInferenceTime": "2023-01-21T01:29:04Z",
    "SageMakerOutput": {
        "predicted_label": 0,
        "score": 0.050630431622266
    },
    "features": [
        30,
        2,
        235847,
        9,
        13,
        4,
        7,
        3,
        4,
        0,
        0,
        0,
        24,
        37
    ]
}

Ground Truth Data

Besides captured data, bias drift monitoring execution also requires ground truth data. In real use cases, you should regularly label the captured data, then upload the ground truth data (labels) to designated S3 location. For demonstration purpose, this example notebook generates fake ground truth data following this schema, and then uploads it to ground_truth_s3_uri which is another key input to the monitor. The bias drift monitoring execution will first merge the captured data and the ground truth data, and then do bias analysis for the merged data.

Notice the value of the data field in groundTruthData must be in the same format as how the ground truth labels are stored in the input dataset.

[17]:
def ground_truth_with_id(inference_id):
    random.seed(inference_id)  # to get consistent results
    label = 1 if random.random() < 0.7 else 0  # randomly generate positive labels 70% of the time
    # format required by the merge job and bias monitoring job
    return {
        "groundTruthData": {
            "data": json.dumps(
                {"label": label}  # Also use the "label" key, the same as in the input dataset.
            ),
            "encoding": "JSON",
        },
        "eventMetadata": {
            "eventId": str(inference_id),
        },
        "eventVersion": "0",
    }


def upload_ground_truth(upload_time, upload_path, inference_ids):
    records = [ground_truth_with_id(inference_id) for inference_id in inference_ids]
    fake_records = [json.dumps(r) for r in records]
    data_to_upload = "\n".join(fake_records)
    target_s3_uri = f"{upload_path}/{upload_time:%Y/%m/%d/%H/%M%S}.jsonl"
    print(f"Uploading {len(fake_records)} records to", target_s3_uri)
    sagemaker.s3.S3Uploader.upload_string_as_file_body(
        body=data_to_upload,
        desired_s3_uri=target_s3_uri,
        sagemaker_session=sagemaker_session,
    )
[18]:
now = datetime.datetime.utcnow()
# Get inference ID from the captured data
inference_ids = [json.loads(record)["SageMakerInferenceId"] for record in transform_output_content]
# Generate data for the last hour, in case the first monitoring execution is in this hour
upload_ground_truth(
    upload_time=now - datetime.timedelta(hours=1),
    upload_path=ground_truth_s3_uri,
    inference_ids=inference_ids,
)
# Generate data for this hour, in case the first monitoring execution will be in the next hour
upload_ground_truth(
    upload_time=now,
    upload_path=ground_truth_s3_uri,
    inference_ids=inference_ids,
)
Uploading 334 records to s3://sagemaker-us-west-2-000000000000/sagemaker/DEMO-ClarifyModelMonitor-1674264462-cc75/ground-truth/2023/01/21/00/3253.jsonl
Uploading 334 records to s3://sagemaker-us-west-2-000000000000/sagemaker/DEMO-ClarifyModelMonitor-1674264462-cc75/ground-truth/2023/01/21/01/3253.jsonl

Model Bias Monitor

Similar to the other monitoring types, the standard procedure of creating a bias drift monitor is first run a baselining job, and then schedule the monitor.

A bias drift monitoring execution starts a merge job that joins the captured data and ground truth data together using the inference ID. Then a SageMaker Clarify bias analysis job is started to compute all the pre-training bias metrics and post-training bias metrics. on the merged data. The max execution time is divided equally between two jobs, the notebook is scheduling an hourly model bias monitor, so the max_runtime_in_seconds parameter should not exceed 1800 seconds.

[19]:
model_bias_monitor = sagemaker.model_monitor.ModelBiasMonitor(
    role=role,
    sagemaker_session=sagemaker_session,
    max_runtime_in_seconds=1800,
)

Baselining job

A baselining job runs predictions on training dataset and suggests constraints. The suggest_baseline() method of ModelBiasMonitor starts a SageMaker Clarify processing job to generate the constraints.

The step is not mandatory, but providing constraints file to the monitor can enable violations file generation.

Configurations

Information about the input data need to be provided to the processor.

DataConfig stores information about the dataset to be analyzed. For example, the dataset file and its format (like JSON Lines), where to store the analysis results. Some special things to note about this configuration for the JSON Lines dataset,

  • The parameter value "features" or "label" is NOT a header string. Instead, it is a JMESPath expression (refer to its specification) that is used to locate the features list or the ground truth label in the dataset. In this example notebook they happen to be the same as the keys in the dataset. But for example, if the dataset has records like below, then the features parameter should use value "data.features.values", and the label parameter should use value "data.label".

    {"data": {"features": {"values": [25, 2, 226802, 1, 7, 4, 6, 3, 2, 1, 0, 0, 40, 37]}, "label": 0}}
    
  • SageMaker Clarify processing job will load the JSON Lines dataset into tabular representation for further analysis, and the parameter headers is the list of column names. The label header shall be the last one in the headers list, and the order of feature headers shall be the same as the order of features in a record.

[20]:
features_jmespath = "features"
ground_truth_label_jmespath = "label"
data_config = sagemaker.clarify.DataConfig(
    s3_data_input_path=train_data_s3_uri,
    s3_output_path=baselining_output_s3_uri,
    features=features_jmespath,
    label=ground_truth_label_jmespath,
    headers=all_headers,
    dataset_type=dataset_type,
)

ModelConfig is configuration related to model to be used for inferencing. In order to compute post-training bias metrics, the computation needs to get inferences for the SageMaker model. To accomplish this, the processing job will use the model to create an ephemeral endpoint (also known as “shadow endpoint”). The processing job will delete the shadow endpoint after the computations are completed. One special thing to note about this configuration for the JSON Lines model input and output,

  • content_template is used by SageMaker Clarify processing job to convert the tabular data to the request payload acceptable to the shadow endpoint. To be more specific, the placeholder $features will be replaced by the features list from records. The request payload of a record from the testing dataset happens to be similar to the record itself, like {"features":[28,2,133937,9,13,2,0,0,4,1,15024,0,55,37]}, because both the dataset and the model input conform to the same format.

[21]:
content_template = '{"features":$features}'
model_config = sagemaker.clarify.ModelConfig(
    model_name=model_name,  # The name of the SageMaker model
    instance_type="ml.m5.xlarge",  # The instance type of the shadow endpoint
    instance_count=1,  # The instance count of the shadow endpoint
    content_type=dataset_type,  # The data format of the model input
    accept_type=dataset_type,  # The data format of the model output
    content_template=content_template,
)

ModelPredictedLabelConfig specifies how to extract predicted label from the model output. The example model returns the predicted label as well as the confidence score, so there are two ways to define this configuration,

  • Set the label parameter to “predicted_label” which is the JMESPath expression to locate the predicted label in the model output. This is the way used in this example.

  • Alternatively, you can set the probability parameter to “score” which is the JMESPath expression to locate the confidence score in the model output. And set the probability_threshold parameter to a floating number in between 0 and 1. The post-training analysis will use it to convert a score to binary predicted label (0 or 1). The default value is 0.5, which means a probability value > 0.5 indicates predicted label 1.

[22]:
predicted_label_jmespath = "predicted_label"
model_predicted_label_config = sagemaker.clarify.ModelPredictedLabelConfig(
    label=predicted_label_jmespath,
)

BiasConfig is the configuration of the sensitive groups in the dataset. Typically, bias is measured by computing a metric and comparing it across groups.

  • The group of interest is specified using the facet parameters. With the following configuration, the baselining job will check for bias in the model’s predictions with respect to gender and income. Specifically, it is checking if the model is more likely to predict that males have an annual income of over $50,000 compared to females. Although not demonstrated in this example, a bias monitor can measure bias against multiple sensitive attributes, if you provide a list of facets.

  • The group_name parameter is used to form subgroups for the measurement of Conditional Demographic Disparity in Labels (CDDL) and Conditional Demographic Disparity in Predicted Labels (CDDPL) with regard to Simpson’s paradox.

[23]:
bias_config = sagemaker.clarify.BiasConfig(
    label_values_or_threshold=[1],  # the positive outcome is earning >$50,000
    facet_name="Sex",  # the sensitive attribute is the gender
    facet_values_or_threshold=[0],  # the disadvantaged group is female
    group_name="Age",
)

Kick off baselining job

Call the suggest_baseline() method to start the baselining job. The job computes all the pre-training bias metrics and post-training bias metrics.

[24]:
model_bias_monitor.suggest_baseline(
    bias_config=bias_config,
    data_config=data_config,
    model_config=model_config,
    model_predicted_label_config=model_predicted_label_config,
)

Job Name:  baseline-suggestion-job-2023-01-21-01-32-54-274
Inputs:  [{'InputName': 'dataset', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-us-west-2-000000000000/sagemaker/DEMO-ClarifyModelMonitor-1674264462-cc75/validation-dataset.jsonl', 'LocalPath': '/opt/ml/processing/input/data', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'analysis_config', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-us-west-2-000000000000/sagemaker/DEMO-ClarifyModelMonitor-1674264462-cc75/baselining-output/analysis_config.json', 'LocalPath': '/opt/ml/processing/input/config', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  [{'OutputName': 'analysis_result', 'AppManaged': False, 'S3Output': {'S3Uri': 's3://sagemaker-us-west-2-000000000000/sagemaker/DEMO-ClarifyModelMonitor-1674264462-cc75/baselining-output', 'LocalPath': '/opt/ml/processing/output', 'S3UploadMode': 'EndOfJob'}}]
[24]:
<sagemaker.processing.ProcessingJob at 0x7fd6c2d66810>

NOTE: The following cell waits until the baselining job is completed (in about 10 minutes). It then inspects the suggested constraints. This step can be skipped, because the monitor to be scheduled will automatically pick up baselining job name and wait for it before monitoring execution.

[25]:
model_bias_monitor.latest_baselining_job.wait(logs=False)
print()
model_bias_constraints = model_bias_monitor.suggested_constraints()
print(f"Suggested constraints: {model_bias_constraints.file_s3_uri}")
print(
    sagemaker.s3.S3Downloader.read_file(
        s3_uri=model_bias_constraints.file_s3_uri,
        sagemaker_session=sagemaker_session,
    )
)
...................................................................................................!
Suggested constraints: s3://sagemaker-us-west-2-000000000000/sagemaker/DEMO-ClarifyModelMonitor-1674264462-cc75/baselining-output/analysis.json
{
    "version": "1.0",
    "post_training_bias_metrics": {
        "label": "Target",
        "facets": {
            "Sex": [
                {
                    "value_or_threshold": "0",
                    "metrics": [
                        {
                            "name": "AD",
                            "description": "Accuracy Difference (AD)",
                            "value": -0.15156641604010024
                        },
                        {
                            "name": "CDDPL",
                            "description": "Conditional Demographic Disparity in Predicted Labels (CDDPL)",
                            "value": 0.28176563733194276
                        },
                        {
                            "name": "DAR",
                            "description": "Difference in Acceptance Rates (DAR)",
                            "value": -0.09508196721311479
                        },
                        {
                            "name": "DCA",
                            "description": "Difference in Conditional Acceptance (DCA)",
                            "value": -0.5278688524590163
                        },
                        {
                            "name": "DCR",
                            "description": "Difference in Conditional Rejection (DCR)",
                            "value": 0.027874251497005953
                        },
                        {
                            "name": "DI",
                            "description": "Disparate Impact (DI)",
                            "value": 0.17798594847775176
                        },
                        {
                            "name": "DPPL",
                            "description": "Difference in Positive Proportions in Predicted Labels (DPPL)",
                            "value": 0.2199248120300752
                        },
                        {
                            "name": "DRR",
                            "description": "Difference in Rejection Rates (DRR)",
                            "value": 0.12565868263473046
                        },
                        {
                            "name": "FT",
                            "description": "Flip Test (FT)",
                            "value": -0.03333333333333333
                        },
                        {
                            "name": "GE",
                            "description": "Generalized Entropy (GE)",
                            "value": 0.0841186702174704
                        },
                        {
                            "name": "RD",
                            "description": "Recall Difference (RD)",
                            "value": 0.1308103661044837
                        },
                        {
                            "name": "SD",
                            "description": "Specificity Difference (SD)",
                            "value": 0.10465328014037645
                        },
                        {
                            "name": "TE",
                            "description": "Treatment Equality (TE)",
                            "value": 2.916666666666667
                        }
                    ]
                }
            ]
        },
        "label_value_or_threshold": "1"
    },
    "pre_training_bias_metrics": {
        "label": "Target",
        "facets": {
            "Sex": [
                {
                    "value_or_threshold": "0",
                    "metrics": [
                        {
                            "name": "CDDL",
                            "description": "Conditional Demographic Disparity in Labels (CDDL)",
                            "value": 0.27459074287718793
                        },
                        {
                            "name": "CI",
                            "description": "Class Imbalance (CI)",
                            "value": 0.36936936936936937
                        },
                        {
                            "name": "DPL",
                            "description": "Difference in Positive Proportions in Labels (DPL)",
                            "value": 0.2326441102756892
                        },
                        {
                            "name": "JS",
                            "description": "Jensen-Shannon Divergence (JS)",
                            "value": 0.04508199943437752
                        },
                        {
                            "name": "KL",
                            "description": "Kullback-Liebler Divergence (KL)",
                            "value": 0.22434464102537785
                        },
                        {
                            "name": "KS",
                            "description": "Kolmogorov-Smirnov Distance (KS)",
                            "value": 0.2326441102756892
                        },
                        {
                            "name": "LP",
                            "description": "L-p Norm (LP)",
                            "value": 0.32900845595810163
                        },
                        {
                            "name": "TVD",
                            "description": "Total Variation Distance (TVD)",
                            "value": 0.2326441102756892
                        }
                    ]
                }
            ]
        },
        "label_value_or_threshold": "1"
    }
}

Monitoring Schedule

With above constraints collected, now call create_monitoring_schedule() method to schedule an hourly model bias monitor.

If a baselining job has been submitted, then the monitor object will automatically pick up the analysis configuration from the baselining job. But if the baselining step is skipped, or if the capture dataset has different nature than the training dataset, then analysis configuration has to be provided.

BiasAnalysisConfig is a subset of the configuration of the baselining job, many options are not needed because,

  • Model bias monitor will merge the captured data and the ground truth data, and then use the merged data as the input dataset.

  • Capture data already includes predictions, so there is no need to create shadow endpoint.

  • Attributes like probability threshold are provided as part of BatchTransformInput.

Highlights,

  • data_capture_s3_uri is the location of data captured by the batch transform job

  • ground_truth_s3_uri is the location of ground truth data

  • features_attribute is the JMESPath expression to locate the features in model input, similar to the features parameter of DataConfig.

  • inference_attribute is the JMESPath expression to locate the predicted label in model output, similar to the label parameter of ModelPredictedLabelConfig.

[26]:
schedule_expression = sagemaker.model_monitor.CronExpressionGenerator.hourly()
[27]:
model_bias_analysis_config = None
if not model_bias_monitor.latest_baselining_job:
    model_bias_analysis_config = sagemaker.clarify.BiasAnalysisConfig(
        bias_config,
        headers=all_headers,
        label=ground_truth_label_jmespath,
    )
model_bias_monitor.create_monitoring_schedule(
    analysis_config=model_bias_analysis_config,
    batch_transform_input=sagemaker.model_monitor.BatchTransformInput(
        data_captured_destination_s3_uri=data_capture_s3_uri,
        destination="/opt/ml/processing/transform",
        dataset_format=sagemaker.model_monitor.MonitoringDatasetFormat.json(lines=True),
        features_attribute=features_jmespath,  # mandatory if no baselining job
        inference_attribute=predicted_label_jmespath,  # mandatory if no baselining job
        # look back 6 hour for transform job output.
        start_time_offset="-PT6H",
        end_time_offset="-PT0H",
    ),
    ground_truth_input=ground_truth_s3_uri,
    output_s3_uri=monitor_output_s3_uri,
    schedule_cron_expression=schedule_expression,
)
print(f"Model bias monitoring schedule: {model_bias_monitor.monitoring_schedule_name}")
Model bias monitoring schedule: monitoring-schedule-2023-01-21-01-41-12-517

Wait for the first execution

The schedule starts jobs at the previously specified intervals. Code below waits until time crosses the hour boundary (in UTC) to see executions kick off.

Note: Even for an hourly schedule, Amazon SageMaker has a buffer period of 20 minutes to schedule executions. The execution might start in anywhere from zero to ~20 minutes from the hour boundary. This is expected and done for load balancing in the backend.

[28]:
def wait_for_execution_to_start(model_monitor):
    print(
        "An hourly schedule was created above and it will kick off executions ON the hour (plus 0 - 20 min buffer)."
    )

    print("Waiting for the first execution to happen", end="")
    schedule_desc = model_monitor.describe_schedule()
    while "LastMonitoringExecutionSummary" not in schedule_desc:
        schedule_desc = model_monitor.describe_schedule()
        print(".", end="", flush=True)
        time.sleep(60)
    print()
    print("Done! Execution has been created")

    print("Now waiting for execution to start", end="")
    while schedule_desc["LastMonitoringExecutionSummary"]["MonitoringExecutionStatus"] in "Pending":
        schedule_desc = model_monitor.describe_schedule()
        print(".", end="", flush=True)
        time.sleep(10)

    print()
    print("Done! Execution has started")

NOTE: The following cell waits until the first monitoring execution is started. As explained above, the wait could take more than 60 minutes.

[29]:
wait_for_execution_to_start(model_bias_monitor)
An hourly schedule was created above and it will kick off executions ON the hour (plus 0 - 20 min buffer).
Waiting for the first execution to happen........................
Done! Execution has been created
Now waiting for execution to start.
Done! Execution has started

In real world, a monitoring schedule is supposed to be active all the time. But in this example, it can be stopped to avoid incurring extra charges. A stopped schedule will not trigger further executions, but the ongoing execution will continue. And if needed, the schedule can be restarted by start_monitoring_schedule().

[30]:
model_bias_monitor.stop_monitoring_schedule()

Stopping Monitoring Schedule with name: monitoring-schedule-2023-01-21-01-41-12-517

Wait for the execution to finish

In the previous cell, the first execution has started. This section waits for the execution to finish so that its analysis results are available. Here are the possible terminal states and what each of them mean:

  • Completed - This means the monitoring execution completed, and no issues were found in the violations report.

  • CompletedWithViolations - This means the execution completed, but constraint violations were detected.

  • Failed - The monitoring execution failed, maybe due to client error (perhaps incorrect role permissions) or infrastructure issues. Further examination of FailureReason and ExitMessage is necessary to identify what exactly happened.

  • Stopped - job exceeded max runtime or was manually stopped.

[31]:
# Waits for the schedule to have last execution in a terminal status.
def wait_for_execution_to_finish(model_monitor):
    schedule_desc = model_monitor.describe_schedule()
    execution_summary = schedule_desc.get("LastMonitoringExecutionSummary")
    if execution_summary is not None:
        print("Waiting for execution to finish", end="")
        while execution_summary["MonitoringExecutionStatus"] not in [
            "Completed",
            "CompletedWithViolations",
            "Failed",
            "Stopped",
        ]:
            print(".", end="", flush=True)
            time.sleep(60)
            schedule_desc = model_monitor.describe_schedule()
            execution_summary = schedule_desc["LastMonitoringExecutionSummary"]
        print()
        print(f"Done! Execution Status: {execution_summary['MonitoringExecutionStatus']}")
    else:
        print("Last execution not found")

NOTE: The following cell takes about 10 minutes.

[32]:
wait_for_execution_to_finish(model_bias_monitor)
Waiting for execution to finish..........
Done! Execution Status: CompletedWithViolations

Merged data

Merged data is the intermediate results of bias drift monitoring execution. It is saved to JSON Lines files under the “merge” folder of monitor_output_s3_uri. Each line is a valid JSON object which combines the captured data and the ground truth data.

[33]:
merged_data_s3_uri = f"{monitor_output_s3_uri}/merge"
merged_data_files = sorted(
    sagemaker.s3.S3Downloader.list(
        s3_uri=merged_data_s3_uri,
        sagemaker_session=sagemaker_session,
    )
)
print("Found merged data files:")
print("\n ".join(merged_data_files[-5:]))
Found merged data files:
s3://sagemaker-us-west-2-000000000000/sagemaker/DEMO-ClarifyModelMonitor-1674264462-cc75/monitor-output/merge/monitoring-schedule-2023-01-21-01-41-12-517/2023/01/21/01/part-00000-a50a31f3-8558-450e-a262-81fbb0a28df1.c000.jsonl

The following cell prints a single line of a merged data file.

  • eventId is the inference ID from the captured data and the ground truth data

  • groundTruthData is from the ground truth data

  • captureData is from the captured data. In this case, the data of batchTransformOutput is from the transform output.

[34]:
merged_record = sagemaker.s3.S3Downloader.read_file(
    s3_uri=merged_data_files[-1],
    sagemaker_session=sagemaker_session,
).splitlines()[0]
print(json.dumps(json.loads(merged_record), indent=4))
{
    "eventMetadata": {
        "eventId": "ae4ec948-edc0-4195-be9c-0b15b6ca114f"
    },
    "eventVersion": "0",
    "groundTruthData": {
        "data": "{\"label\": 1}",
        "encoding": "JSON"
    },
    "captureData": {
        "batchTransformOutput": {
            "data": "{\"SageMakerOutput\":{\"predicted_label\":1,\"score\":0.989977359771728},\"features\":[28,2,133937,9,13,2,0,0,4,1,15024,0,55,37]}",
            "encoding": "JSON"
        }
    }
}

Inspect execution results

List the generated reports,

  • analysis.json includes all the bias metrics.

  • report.* files are static report files to visualize the bias metrics

[35]:
schedule_desc = model_bias_monitor.describe_schedule()
execution_summary = schedule_desc.get("LastMonitoringExecutionSummary")
if execution_summary and execution_summary["MonitoringExecutionStatus"] in [
    "Completed",
    "CompletedWithViolations",
]:
    last_model_bias_monitor_execution = model_bias_monitor.list_executions()[-1]
    last_model_bias_monitor_execution_report_uri = (
        last_model_bias_monitor_execution.output.destination
    )
    print(f"Report URI: {last_model_bias_monitor_execution_report_uri}")
    last_model_bias_monitor_execution_report_files = sorted(
        sagemaker.s3.S3Downloader.list(
            s3_uri=last_model_bias_monitor_execution_report_uri,
            sagemaker_session=sagemaker_session,
        )
    )
    print("Found Report Files:")
    print("\n ".join(last_model_bias_monitor_execution_report_files))
else:
    last_model_bias_monitor_execution = None
    print(
        "====STOP==== \n No completed executions to inspect further. Please wait till an execution completes or investigate previously reported failures."
    )
Report URI: s3://sagemaker-us-west-2-000000000000/sagemaker/DEMO-ClarifyModelMonitor-1674264462-cc75/monitor-output/monitoring-schedule-2023-01-21-01-41-12-517/2023/01/21/02
Found Report Files:
s3://sagemaker-us-west-2-000000000000/sagemaker/DEMO-ClarifyModelMonitor-1674264462-cc75/monitor-output/monitoring-schedule-2023-01-21-01-41-12-517/2023/01/21/02/analysis.json
 s3://sagemaker-us-west-2-000000000000/sagemaker/DEMO-ClarifyModelMonitor-1674264462-cc75/monitor-output/monitoring-schedule-2023-01-21-01-41-12-517/2023/01/21/02/constraint_violations.json
 s3://sagemaker-us-west-2-000000000000/sagemaker/DEMO-ClarifyModelMonitor-1674264462-cc75/monitor-output/monitoring-schedule-2023-01-21-01-41-12-517/2023/01/21/02/report.html
 s3://sagemaker-us-west-2-000000000000/sagemaker/DEMO-ClarifyModelMonitor-1674264462-cc75/monitor-output/monitoring-schedule-2023-01-21-01-41-12-517/2023/01/21/02/report.ipynb
 s3://sagemaker-us-west-2-000000000000/sagemaker/DEMO-ClarifyModelMonitor-1674264462-cc75/monitor-output/monitoring-schedule-2023-01-21-01-41-12-517/2023/01/21/02/report.pdf

If there are any violations compared to the baseline, they are listed here. See Bias Drift Violations for the schema of the file, and how violations are detected.

[36]:
violations = model_bias_monitor.latest_monitoring_constraint_violations()
if violations is not None:
    pprint.PrettyPrinter(indent=4).pprint(violations.body_dict)
{   'version': '1.0',
    'violations': [   {   'constraint_check_type': 'bias_drift_check',
                          'description': 'Metric value 0.374894782529513 '
                                         "doesn't meet the baseline constraint "
                                         'requirement 0.28176563733194276',
                          'facet': 'Sex',
                          'facet_value': '0',
                          'metric_name': 'CDDPL'},
                      {   'constraint_check_type': 'bias_drift_check',
                          'description': 'Metric value -0.26530612244897955 '
                                         "doesn't meet the baseline constraint "
                                         'requirement -0.09508196721311479',
                          'facet': 'Sex',
                          'facet_value': '0',
                          'metric_name': 'DAR'},
                      {   'constraint_check_type': 'bias_drift_check',
                          'description': 'Metric value -36.6530612244898 '
                                         "doesn't meet the baseline constraint "
                                         'requirement -0.5278688524590163',
                          'facet': 'Sex',
                          'facet_value': '0',
                          'metric_name': 'DCA'},
                      {   'constraint_check_type': 'bias_drift_check',
                          'description': 'Metric value -0.06507936507936507 '
                                         "doesn't meet the baseline constraint "
                                         'requirement 0.027874251497005953',
                          'facet': 'Sex',
                          'facet_value': '0',
                          'metric_name': 'DCR'},
                      {   'constraint_check_type': 'bias_drift_check',
                          'description': 'Metric value 0.9027966400080482 '
                                         "doesn't meet the baseline constraint "
                                         'requirement 0.0841186702174704',
                          'facet': 'Sex',
                          'facet_value': '0',
                          'metric_name': 'GE'},
                      {   'constraint_check_type': 'bias_drift_check',
                          'description': 'Metric value 0.19451219512195123 '
                                         "doesn't meet the baseline constraint "
                                         'requirement 0.1308103661044837',
                          'facet': 'Sex',
                          'facet_value': '0',
                          'metric_name': 'RD'},
                      {   'constraint_check_type': 'bias_drift_check',
                          'description': 'Metric value 0.21666666666666667 '
                                         "doesn't meet the baseline constraint "
                                         'requirement 0.10465328014037645',
                          'facet': 'Sex',
                          'facet_value': '0',
                          'metric_name': 'SD'},
                      {   'constraint_check_type': 'bias_drift_check',
                          'description': "Metric value Infinity doesn't meet "
                                         'the baseline constraint requirement '
                                         '2.916666666666667',
                          'facet': 'Sex',
                          'facet_value': '0',
                          'metric_name': 'TE'}]}

By default, the analysis results are also published to CloudWatch, see CloudWatch Metrics for Bias Drift Analysis.

Cleanup

If there is no plan to collect more data for bias drift monitoring, then the monitor should be stopped (and deleted) to avoid incurring additional charges. Note that deleting the monitor does not delete the data in S3.

[37]:
model_bias_monitor.stop_monitoring_schedule()
wait_for_execution_to_finish(model_bias_monitor)
model_bias_monitor.delete_monitoring_schedule()
sagemaker_session.delete_model(model_name)

Stopping Monitoring Schedule with name: monitoring-schedule-2023-01-21-01-41-12-517
Waiting for execution to finish
Done! Execution Status: CompletedWithViolations

Deleting Monitoring Schedule with name: monitoring-schedule-2023-01-21-01-41-12-517

Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.

This us-east-1 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This us-east-2 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This us-west-1 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This ca-central-1 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This sa-east-1 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This eu-west-1 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This eu-west-2 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This eu-west-3 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This eu-central-1 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This eu-north-1 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This ap-southeast-1 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This ap-southeast-2 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This ap-northeast-1 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This ap-northeast-2 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This ap-south-1 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable