Get started with SageMaker Processing

This notebook’s CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

This notebook corresponds to the section “Preprocessing Data With The Built-In Scikit-Learn Container” in the blog post Amazon SageMaker Processing – Fully Managed Data Processing and Model Evaluation. It shows a lightweight example of using SageMaker Processing to create train, test, and validation datasets. SageMaker Processing is used to create these datasets, which then are written back to S3.

Runtime

This notebook takes approximately 5 minutes to run.

Prepare resources

First, let’s create an SKLearnProcessor object, passing the scikit-learn version we want to use, as well as our managed infrastructure requirements.

[ ]:

!pip install -U sagemaker

[2]:

import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.sklearn.processing import SKLearnProcessor

region = sagemaker.Session().boto_region_name
role = get_execution_role()
sklearn_processor = SKLearnProcessor(
    framework_version="1.2-1", role=role, instance_type="ml.m5.xlarge", instance_count=1
)

Download data

Read in the raw data from a public S3 bucket. This example uses the Census-Income (KDD) Dataset from the UCI Machine Learning Repository.

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

[3]:

import pandas as pd

s3 = boto3.client("s3")
s3.download_file(
    "sagemaker-sample-data-{}".format(region),
    "processing/census/census-income.csv",
    "census-income.csv",
)
df = pd.read_csv("census-income.csv")
df.to_csv("dataset.csv")
df.head()

[3]:

	age	class of worker	detailed industry recode	detailed occupation recode	education	enroll in edu inst last wk	marital stat	major industry code	major occupation code	...	country of birth father	country of birth mother	country of birth self	citizenship	fill inc questionnaire for veteran's admin	veterans benefits	weeks worked in year	year	income
0	73	Not in universe	0	0	High school graduate	Not in universe	Widowed	Not in universe or children	Not in universe	...	United-States	United-States	United-States	Native- Born in the United States	Not in universe	2	0	95	- 50000.
1	58	Self-employed-not incorporated	4	34	Some college but no degree	Not in universe	Divorced	Construction	Precision production craft & repair	...	United-States	United-States	United-States	Native- Born in the United States	Not in universe	2	52	94	- 50000.
2	18	Not in universe	0	0	10th grade	High school	Never married	Not in universe or children	Not in universe	...	Vietnam	Vietnam	Vietnam	Foreign born- Not a citizen of U S	Not in universe	2	0	95	- 50000.
3	9	Not in universe	0	0	Children	Not in universe	Never married	Not in universe or children	Not in universe	...	United-States	United-States	United-States	Native- Born in the United States	Not in universe	0	0	94	- 50000.
4	10	Not in universe	0	0	Children	Not in universe	Never married	Not in universe or children	Not in universe	...	United-States	United-States	United-States	Native- Born in the United States	Not in universe	0	0	94	- 50000.

5 rows × 42 columns

Prepare Processing script

Write the Python script that will be run by SageMaker Processing. This script reads the single data file from S3; splits the rows into train, test, and validation sets; and then writes the three output files to S3.

[4]:

%%writefile preprocessing.py
import pandas as pd
import os
from sklearn.model_selection import train_test_split

input_data_path = os.path.join("/opt/ml/processing/input", "dataset.csv")
df = pd.read_csv(input_data_path)
print("Shape of data is:", df.shape)
train, test = train_test_split(df, test_size=0.2)
train, validation = train_test_split(train, test_size=0.2)

try:
    os.makedirs("/opt/ml/processing/output/train")
    os.makedirs("/opt/ml/processing/output/validation")
    os.makedirs("/opt/ml/processing/output/test")
    print("Successfully created directories")
except Exception as e:
    # if the Processing call already creates these directories (or directory otherwise cannot be created)
    print(e)
    print("Could not make directories")
    pass

try:
    train.to_csv("/opt/ml/processing/output/train/train.csv")
    validation.to_csv("/opt/ml/processing/output/validation/validation.csv")
    test.to_csv("/opt/ml/processing/output/test/test.csv")
    print("Wrote files successfully")
except Exception as e:
    print("Failed to write the files")
    print(e)
    pass

print("Completed running the processing job")

Writing preprocessing.py

Run Processing job

Run the Processing job, specifying the script name, input file, and output files.

[5]:

%%capture output

from sagemaker.processing import ProcessingInput, ProcessingOutput

sklearn_processor.run(
    code="preprocessing.py",
    # arguments = ["arg1", "arg2"], # Arguments can optionally be specified here
    inputs=[ProcessingInput(source="dataset.csv", destination="/opt/ml/processing/input")],
    outputs=[
        ProcessingOutput(source="/opt/ml/processing/output/train"),
        ProcessingOutput(source="/opt/ml/processing/output/validation"),
        ProcessingOutput(source="/opt/ml/processing/output/test"),
    ],
)

Get the Processing job logs and retrieve the job name.

[6]:

print(output)
job_name = str(output).split("\n")[1].split(" ")[-1]


Job Name:  sagemaker-scikit-learn-2022-04-18-00-09-00-899
Inputs:  [{'InputName': 'input-1', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-us-west-2-000000000000/sagemaker-scikit-learn-2022-04-18-00-09-00-899/input/input-1/dataset.csv', 'LocalPath': '/opt/ml/processing/input', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'code', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-us-west-2-000000000000/sagemaker-scikit-learn-2022-04-18-00-09-00-899/input/code/preprocessing.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  [{'OutputName': 'output-1', 'AppManaged': False, 'S3Output': {'S3Uri': 's3://sagemaker-us-west-2-000000000000/sagemaker-scikit-learn-2022-04-18-00-09-00-899/output/output-1', 'LocalPath': '/opt/ml/processing/output/train', 'S3UploadMode': 'EndOfJob'}}, {'OutputName': 'output-2', 'AppManaged': False, 'S3Output': {'S3Uri': 's3://sagemaker-us-west-2-000000000000/sagemaker-scikit-learn-2022-04-18-00-09-00-899/output/output-2', 'LocalPath': '/opt/ml/processing/output/validation', 'S3UploadMode': 'EndOfJob'}}, {'OutputName': 'output-3', 'AppManaged': False, 'S3Output': {'S3Uri': 's3://sagemaker-us-west-2-000000000000/sagemaker-scikit-learn-2022-04-18-00-09-00-899/output/output-3', 'LocalPath': '/opt/ml/processing/output/test', 'S3UploadMode': 'EndOfJob'}}]
...........................
Shape of data is: (199523, 43)
[Errno 17] File exists: '/opt/ml/processing/output/train'
Could not make directories
Wrote files successfully
Completed running the processing job

Confirm that the output dataset files were written to S3.

[7]:

import boto3

s3_client = boto3.client("s3")
default_bucket = sagemaker.Session().default_bucket()
for i in range(1, 4):
    prefix = s3_client.list_objects(
        Bucket=default_bucket, Prefix="sagemaker-scikit-learn"
    )["Contents"][-i]["Key"]
    print("s3://" + default_bucket + "/" + prefix)

s3://sagemaker-us-west-2-000000000000/sagemaker-scikit-learn-2022-04-18-00-09-00-899/output/output-1/train.csv
s3://sagemaker-us-west-2-000000000000/sagemaker-scikit-learn-2022-04-18-00-09-00-899/output/output-2/validation.csv
s3://sagemaker-us-west-2-000000000000/sagemaker-scikit-learn-2022-04-18-00-09-00-899/output/output-3/test.csv

Conclusion

In this notebook, we read a dataset from S3 and processed it into train, test, and validation sets using a SageMaker Processing job. You can extend this example for preprocessing your own datasets in preparation for machine learning or other applications.

Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.