Amazon SageMaker Clarify Model Bias Monitor for Batch Transform

This notebook’s CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

This us-west-2 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable


This notebook takes approximately 60 minutes to run.



Amazon SageMaker Model Monitor continuously monitors the quality of Amazon SageMaker machine learning models in production. It enables developers to set alerts for when there are deviations in the model quality. Early and pro-active detection of these deviations enables corrective actions, such as retraining models, auditing upstream systems, or fixing data quality issues without having to monitor models manually or build additional tooling.

Amazon SageMaker Clarify Model Bias Monitor is a model monitor that helps data scientists and ML engineers monitor predictions for bias on a regular basis. Bias can be introduced or exacerbated in deployed ML models when the training data differs from the data that the model sees during deployment (that is, the live data). These kinds of changes in the live data distribution might be temporary (for example, due to some short-lived, real-world events) or permanent. In either case, it might be important to detect these changes. For example, the outputs of a model for predicting home prices can become biased if the mortgage rates used to train the model differ from current, real-world mortgage rates. With bias drift detection capabilities in model monitor, when SageMaker detects bias beyond a certain threshold, it automatically generates metrics that you can view in SageMaker Studio and through Amazon CloudWatch alerts.

This notebook demonstrates the process for setting up a model monitor for continuous monitoring of bias drift of the data and model used by a regularly running SageMaker Batch Transform job. The model input and output are in CSV format.

In general, you can use the model bias monitor for batch transform in this way,

  1. Schedule a model bias monitor to monitor a data capture S3 location and a ground truth S3 location

  2. Regularly run transform jobs with data capture enabled, the jobs save captured data to the data capture S3 URI

  3. Regularly label the captured data, and then upload the ground truth labels to the ground truth S3 URI

The monitor executes processing jobs regularly to merge the captured data and ground truth data, do bias analysis for the merged data, and then generate analysis reports and publish metrics to CloudWatch.

General Setup

The notebook uses the SageMaker Python SDK. The following cell upgrades the SDK and its dependencies. Then you may need to restart the kernel and rerun the notebook to pick up the up-to-date APIs, if the notebook is executed in the SageMaker Studio.

[ ]:
!pip install -U sagemaker
!pip install -U boto3
!pip install -U botocore


The following cell imports the APIs to be used by the notebook.

[ ]:
import sagemaker
import pandas as pd
import datetime
import io
import json
import os
import random
import time
import pprint

Handful of configuration

To begin, ensure that these prerequisites have been completed.

  • Specify an AWS Region to host the model.

  • Specify an IAM role to execute jobs.

  • Define the S3 URIs that stores the model file, input data and output data. For demonstration purposes, this notebook uses the same bucket for them. In reality, they could be separated with different security policies.

[ ]:
sagemaker_session = sagemaker.Session()

region = sagemaker_session.boto_region_name
print(f"AWS region: {region}")

role = sagemaker.get_execution_role()
print(f"RoleArn: {role}")

# A different bucket can be used, but make sure the role for this notebook has
# the s3:PutObject permissions. This is the bucket into which the data is captured
bucket = sagemaker_session.default_bucket()
print(f"Demo Bucket: {bucket}")
prefix = sagemaker.utils.unique_name_from_base("sagemaker/DEMO-ClarifyModelMonitor")
print(f"Demo Prefix: {prefix}")
s3_key = f"s3://{bucket}/{prefix}"
print(f"Demo S3 key: {s3_key}")

data_capture_s3_uri = f"{s3_key}/data-capture"
ground_truth_s3_uri = f"{s3_key}/ground-truth"
transform_output_s3_uri = f"{s3_key}/transform-output"
baselining_output_s3_uri = f"{s3_key}/baselining-output"
monitor_output_s3_uri = f"{s3_key}/monitor-output"

print(f"The transform job will save the results to: {transform_output_s3_uri}")
print(f"The transform job will save the captured data to: {data_capture_s3_uri}")
print(f"You should upload the ground truth data to: {ground_truth_s3_uri}")
print(f"The baselining job will save the analysis results to: {baselining_output_s3_uri}")
print(f"The monitor will save the analysis results to: {monitor_output_s3_uri}")

Data files

This example includes two dataset files, both in CSV format.

  • The train dataset has header row, and it has a target column followed by the feature columns.

  • The test dataset is not headers, and it only has feature columns.

[ ]:
train_dataset_path = "test_data/validation-dataset-with-header.csv"
test_dataset_path = "test_data/test-dataset-input-cols.csv"
dataset_type = "text/csv"
[ ]:
df = pd.read_csv(train_dataset_path)
[ ]:
# Read headers
all_headers = list(df.columns)
label_header = all_headers[0]

To verify that the execution role for this notebook has the necessary permissions to proceed, put a simple test object into the S3 bucket specified above. If this command fails, update the role to have s3:PutObject permission on the bucket and try again.

[ ]:
print("Success! We are all set to proceed with uploading to S3.")

Then upload the data files to S3 so that they can be used by SageMaker jobs.

[ ]:
train_data_s3_uri = sagemaker.s3.S3Uploader.upload(
print(f"Train data is uploaded to: {train_data_s3_uri}")
test_data_s3_uri = sagemaker.s3.S3Uploader.upload(
print(f"Test data is uploaded to: {test_data_s3_uri}")

SageMaker model

This example includes a pre-built SageMaker XGBoost model file trained by XGBoost Churn Prediction Notebook. The following cell uploads the file to S3 and then creates a SageMaker model using it. The model support CSV data format, the input are customer attributes, and the output is the probability of customer churn (a float number between zero and one).

[ ]:
model_file = "model/xgb-churn-prediction-model.tar.gz"
model_url = sagemaker.s3.S3Uploader.upload(
print(f"Model file has been uploaded to {model_url}")

model_name = sagemaker.utils.unique_name_from_base("DEMO-xgb-churn-pred-model-monitor")
print(f"SageMaker model name: {model_name}")

image_uri = sagemaker.image_uris.retrieve("xgboost", region, "0.90-1")
print(f"SageMaker XGBoost image: {image_uri}")

model = sagemaker.model.Model(image_uri=image_uri, model_data=model_url, role=role)
container_def = model.prepare_container_def()
sagemaker_session.create_model(model_name, role, container_def)
print("SageMaker model created")

Batch Transform Job

For continuous monitoring, batch transform jobs should be executed regularly with the latest data. But for demonstration purpose, the following cell only executes the job once before the monitor is scheduled, so that the first monitoring execution has captured data to process.

See Transformer for the API reference. Highlights,

  • destination_s3_uri is used to specify the data capture S3 URI which is a key connection between the job and the monitor.

  • join_source must be set to “Input” for the transform output to include predictions (model output) as well as features (model input), because model bias monitor requires both.

  • generate_inference_id must be set to True for the transform output to include a unique ID for each record. Model bias monitor requires both predicted labels and ground truth labels, so it needs the ID to join the captured data and the ground truth data.

NOTE: The following cell takes about 5 minutes to run.

[ ]:
transfomer = model.transformer(
    accept=dataset_type,  # The transform output data format
    assemble_with="Line",  # CSV records are terminated by new lines

    content_type=dataset_type,  # The transform input format
    split_type="Line",  # CSV records are terminated by new lines
    join_source="Input",  # Include model input (features) in transform output
        generate_inference_id=True,  # Inference ID is mandatory to join the captured data and the ground truth data
    wait=True,  # In real world you don't have to wait, but for demo purpose we wait for the output
    logs=False,  # You can change it to True to view job logs inline

Captured data

Once the transform job completed, an “output” folders is created under data_capture_s3_uri, to includes the captured data files of transform output. Note that, batch transform data capture is unlike endpoint data capture, it does not capture the data for real as it will create tremendous amount of duplications. Instead, it generates manifest files which refer to the transform output S3 location.

Now list the captured data files stored in Amazon S3. There should be different files from different time periods organized based on the hour in which the batch transformation occurred. The format of the Amazon S3 path is:


[ ]:
data_capture_output = f"{data_capture_s3_uri}/output"
captured_data_files = sorted(
print("Found capture data files:")
print("\n ".join(captured_data_files[-5:]))
[ ]:
data_capture_output_dict = json.loads(
print(json.dumps(data_capture_output_dict, indent=4))

Transform output

The captured data file refers to the transform output .out file. The cell below shows the first few records of the file.

  • The first columns are the feature columns (the model input), because the join_source parameter is set to “Input”.

  • Then there is the prediction column (the model output).

  • The second last element is the inference ID, and the last is the inference time (the start time of the transform job). They are available because the generate_inference_id parameter is set to True.

[ ]:
transform_output = os.path.join(data_capture_output_dict[0]["prefix"], data_capture_output_dict[1])
transform_output_content = sagemaker.s3.S3Downloader.read_file(
transform_output_df = pd.read_csv(io.StringIO(transform_output_content), header=None)

Ground Truth Data

Besides captured data, bias drift monitoring execution also requires ground truth data. In real use cases, you should regularly label the captured data, then upload the ground truth data (labels) to designated S3 location. For demonstration purpose, this example notebook generates fake ground truth data following this schema, and then uploads it to ground_truth_s3_uri which is another key input to the monitor. The bias drift monitoring execution will first merge the captured data and the ground truth data, and then do bias analysis for the merged data.

[ ]:
def ground_truth_with_id(inference_id):
    random.seed(inference_id)  # to get consistent results
    # format required by the merge job and bias monitoring job
    return {
        "groundTruthData": {
            "data": "1"
            if random.random() < 0.7
            else "0",  # randomly generate positive labels 70% of the time
            "encoding": "CSV",
        "eventMetadata": {
            "eventId": str(
            ),  # the id is used to join the captured data and the ground truth data
        "eventVersion": "0",

def upload_ground_truth(upload_time, upload_path, inference_ids):
    records = [ground_truth_with_id(inference_id) for inference_id in inference_ids]
    fake_records = [json.dumps(r) for r in records]
    data_to_upload = "\n".join(fake_records)
    target_s3_uri = f"{upload_path}/{upload_time:%Y/%m/%d/%H/%M%S}.jsonl"
    print(f"Uploading {len(fake_records)} records to", target_s3_uri)
[ ]:
now = datetime.datetime.utcnow()
inference_ids = list(transform_output_df.iloc[:, -2])  # get inference ID from the captured data
# Generate data for the last hour, in case the first monitoring execution is in this hour
    upload_time=now - datetime.timedelta(hours=1),
# Generate data for this hour, in case the first monitoring execution will be in the next hour

Model Bias Monitor

Similar to the other monitoring types, the standard procedure of creating a bias drift monitor is first run a baselining job, and then schedule the monitor.

A bias drift monitoring execution starts a merge job that joins the captured data and ground truth data together using the inference ID. Then a SageMaker Clarify bias analysis job is started to compute all the pre-training bias metrics and post-training bias metrics. on the merged data. The max execution time is divided equally between two jobs, the notebook is scheduling an hourly model bias monitor, so the max_runtime_in_seconds parameter should not exceed 1800 seconds.

[ ]:
model_bias_monitor = sagemaker.model_monitor.ModelBiasMonitor(

Baselining job

A baselining job runs predictions on training dataset and suggests constraints. The suggest_baseline() method of ModelBiasMonitor starts a SageMaker Clarify processing job to generate the constraints.

The step is not mandatory, but providing constraints file to the monitor can enable violations file generation.


Information about the input data need to be provided to the processor.

DataConfig stores information about the dataset to be analyzed. For example, the dataset file, its format (like CSV), headers and label.

[ ]:
data_config = sagemaker.clarify.DataConfig(

ModelConfig is configuration related to model to be used for inferencing. In order to compute post-training bias metrics, the computation needs to get inferences for the SageMaker model. To accomplish this, the processing job will use the model to create an ephemeral endpoint (also known as “shadow endpoint”). The processing job will delete the shadow endpoint after the computations are completed.

[ ]:
model_config = sagemaker.clarify.ModelConfig(
    model_name=model_name,  # The name of the SageMaker model
    instance_type="ml.m5.xlarge",  # The instance type of the shadow endpoint
    instance_count=1,  # The instance count of the shadow endpoint
    content_type=dataset_type,  # The data format of the model input
    accept_type=dataset_type,  # The data format of the model output

ModelPredictedLabelConfig specifies how to extract predicted label from the model output. The example model returns a single probability value between 0 and 1. So,

  • The probability parameter is set to zero, which is the index of the probability value in the CSV model output.

  • The probability_threshold parameter is used by post-training analysis to convert the probability/score to binary predicted label (0 or 1). The default value is 0.5. Here choose an arbitrary 0.8 cutoff, i.e. a probability value > 0.8 means customer will churn (predicted label 1).

[ ]:
model_predicted_label_config = sagemaker.clarify.ModelPredictedLabelConfig(
    probability=0,  # The zero-based index of the probability (score) in model output

BiasConfig is the configuration of the sensitive groups in the dataset. Typically, bias is measured by computing a metric and comparing it across groups. The group of interest is specified using the facet configuration. With the following facet, the baselining job will check if the model favors new customers (accounts created not far ago).

[ ]:
bias_config = sagemaker.clarify.BiasConfig(
    facet_name="Account Length",

Kick off baselining job

Call the suggest_baseline() method to start the baselining job. The job computes all pre-training bias metrics and post-training bias metrics.

[ ]:

NOTE: The following cell waits until the baselining job is completed (in about 10 minutes). It then inspects the suggested constraints. This step can be skipped, because the monitor to be scheduled will automatically pick up baselining job name and wait for it before monitoring execution.

[ ]:
model_bias_constraints = model_bias_monitor.suggested_constraints()
print(f"Suggested constraints: {model_bias_constraints.file_s3_uri}")

Monitoring Schedule

With above constraints collected, now call create_monitoring_schedule() method to schedule an hourly model bias monitor.

If a baselining job has been submitted, then the monitor object will automatically pick up the analysis configuration from the baselining job. But if the baselining step is skipped, or if the capture dataset has different nature than the training dataset, then analysis configuration has to be provided.

BiasAnalysisConfig is a subset of the configuration of the baselining job, many options are not needed because,

  • Model bias monitor will merge the captured data and the ground truth data, and then use the merged data as the input dataset.

  • Capture data already includes predictions, so there is no need to create shadow endpoint.

  • Attributes like probability threshold are provided as part of BatchTransformInput.


  • data_capture_s3_uri is the location of data captured by the batch transform job

  • ground_truth_s3_uri is the location of ground truth data

  • probability_attribute stores the index of the probability value in model output. (Similar to the probability parameter of ModelPredictedLabelConfig.)

  • probability_threshold_attribute is the same as the probability_threshold parameter of ModelPredictedLabelConfig.

[ ]:
schedule_expression = sagemaker.model_monitor.CronExpressionGenerator.hourly()
[ ]:
model_bias_analysis_config = None
if not model_bias_monitor.latest_baselining_job:
    model_bias_analysis_config = sagemaker.clarify.BiasAnalysisConfig(
        # look back 6 hour for transform job output.
print(f"Model bias monitoring schedule: {model_bias_monitor.monitoring_schedule_name}")

Wait for the first execution

The schedule starts jobs at the previously specified intervals. Code below waits until time crosses the hour boundary (in UTC) to see executions kick off.

Note: Even for an hourly schedule, Amazon SageMaker has a buffer period of 20 minutes to schedule executions. The execution might start in anywhere from zero to ~20 minutes from the hour boundary. This is expected and done for load balancing in the backend.

[ ]:
def wait_for_execution_to_start(model_monitor):
        "An hourly schedule was created above and it will kick off executions ON the hour (plus 0 - 20 min buffer)."

    print("Waiting for the first execution to happen", end="")
    schedule_desc = model_monitor.describe_schedule()
    while "LastMonitoringExecutionSummary" not in schedule_desc:
        schedule_desc = model_monitor.describe_schedule()
        print(".", end="", flush=True)
    print("Done! Execution has been created")

    print("Now waiting for execution to start", end="")
    while schedule_desc["LastMonitoringExecutionSummary"]["MonitoringExecutionStatus"] in "Pending":
        schedule_desc = model_monitor.describe_schedule()
        print(".", end="", flush=True)

    print("Done! Execution has started")

NOTE: The following cell waits until the first monitoring execution is started. As explained above, the wait could take more than 60 minutes.

[ ]:

In real world, a monitoring schedule is supposed to be active all the time. But in this example, it can be stopped to avoid incurring extra charges. A stopped schedule will not trigger further executions, but the ongoing execution will continue. And if needed, the schedule can be restarted by start_monitoring_schedule().

[ ]:

Wait for the execution to finish

In the previous cell, the first execution has started. This section waits for the execution to finish so that its analysis results are available. Here are the possible terminal states and what each of them mean:

  • Completed - This means the monitoring execution completed, and no issues were found in the violations report.

  • CompletedWithViolations - This means the execution completed, but constraint violations were detected.

  • Failed - The monitoring execution failed, maybe due to client error (perhaps incorrect role permissions) or infrastructure issues. Further examination of FailureReason and ExitMessage is necessary to identify what exactly happened.

  • Stopped - job exceeded max runtime or was manually stopped.

[ ]:
# Waits for the schedule to have last execution in a terminal status.
def wait_for_execution_to_finish(model_monitor):
    schedule_desc = model_monitor.describe_schedule()
    execution_summary = schedule_desc.get("LastMonitoringExecutionSummary")
    if execution_summary is not None:
        print("Waiting for execution to finish", end="")
        while execution_summary["MonitoringExecutionStatus"] not in [
            print(".", end="", flush=True)
            schedule_desc = model_monitor.describe_schedule()
            execution_summary = schedule_desc["LastMonitoringExecutionSummary"]
        print(f"Done! Execution Status: {execution_summary['MonitoringExecutionStatus']}")
        print("Last execution not found")

NOTE: The following cell takes about 10 minutes.

[ ]:

Merged data

Merged data is the intermediate results of bias drift monitoring execution. It is saved to JSON Lines files under the “merge” folder of monitor_output_s3_uri. Each line is a valid JSON object which combines the captured data and the ground truth data.

[ ]:
merged_data_s3_uri = f"{monitor_output_s3_uri}/merge"
merged_data_files = sorted(
print("Found merged data files:")
print("\n ".join(merged_data_files[-5:]))

The following cell prints a single line of a merged data file.

  • eventId is the inference ID from the captured data and the ground truth data

  • groundTruthData is from the ground truth data

  • captureData is from the captured data. In this case, the data of batchTransformOutput is from the transform output.

[ ]:
merged_record = sagemaker.s3.S3Downloader.read_file(
print(json.dumps(json.loads(merged_record), indent=4))

Inspect execution results

List the generated reports,

  • analysis.json includes all the bias metrics.

  • report.* files are static report files to visualize the bias metrics

[ ]:
schedule_desc = model_bias_monitor.describe_schedule()
execution_summary = schedule_desc.get("LastMonitoringExecutionSummary")
if execution_summary and execution_summary["MonitoringExecutionStatus"] in [
    last_model_bias_monitor_execution = model_bias_monitor.list_executions()[-1]
    last_model_bias_monitor_execution_report_uri = (
    print(f"Report URI: {last_model_bias_monitor_execution_report_uri}")
    last_model_bias_monitor_execution_report_files = sorted(
    print("Found Report Files:")
    print("\n ".join(last_model_bias_monitor_execution_report_files))
    last_model_bias_monitor_execution = None
        "====STOP==== \n No completed executions to inspect further. Please wait till an execution completes or investigate previously reported failures."

If there are any violations compared to the baseline, they are listed here. See Bias Drift Violations for the schema of the file, and how violations are detected.

[ ]:
violations = model_bias_monitor.latest_monitoring_constraint_violations()
if violations is not None:

By default, the analysis results are also published to CloudWatch, see CloudWatch Metrics for Bias Drift Analysis.


If there is no plan to collect more data for bias drift monitoring, then the monitor should be stopped (and deleted) to avoid incurring additional charges. Note that deleting the monitor does not delete the data in S3.

[ ]:

Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.

This us-east-1 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This us-east-2 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This us-west-1 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This ca-central-1 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This sa-east-1 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This eu-west-1 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This eu-west-2 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This eu-west-3 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This eu-central-1 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This eu-north-1 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This ap-southeast-1 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This ap-southeast-2 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This ap-northeast-1 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This ap-northeast-2 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable

This ap-south-1 badge failed to load. Check your device’s internet connectivity, otherwise the service is currently unavailable