Ingest Tabular Data

When ingesting structured data from an existing S3 bucket into a SageMaker Notebook, there are multiple ways to handle it. We will introduce the following methods to access your data from the notebook:

  • Copying your data to your instance. If you are dealing with a normal size of data or are simply experimenting, you can copy the files into the SageMaker instance and just use it as a file system in your local machine.

  • Using Python packages to directly access your data without copying it. One downside of copying your data to your instance is: if you are done with your notebook instance and delete it, all the data is gone with it unless you store it elsewhere. We will introduce several methods to solve this problem in this notebook, and using python packages is one of them. Also, if you have large data sets (for example, with millions of rows), you can directly read data from S3 utilizing S3 compatible python libraries with built-in functions.

  • Using AWS native methods to directly access your data. You can also use AWS native packages like s3fs and aws data wrangler to access your data directly.

We will demonstrate how to ingest the following tabular (structured) into a notebook for further analysis: ## Tabular data: California Housing Data The California Housing dataset contains information from the 1990 California census. We will use the data set to showcase how to ingest tabular data into S3, and for further pre-processing and feature engineering. The dataset contains the following columns:

  • MedInc - average income.

  • HouseAge - housing average age.

  • AveRooms - average rooms.

  • AveBedrms - average bedrooms.

  • Population - population.

  • AveOccup - average occupation.

  • Latitude - latitude.

  • Longitude - longitude.

The California Housing dataset was originally published in:

Pace, R. Kelley, and Ronald Barry. “Sparse spatial autoregressions.” Statistics & Probability Letters 33.3 (1997): 291-297.

Download data from online resources and write data to S3

[ ]:
%pip install -qU 'sagemaker>=2.15.0' 's3fs==0.4.2' 'awswrangler==1.2.0'
# you would need s3fs version > 0.4.0 for aws data wrangler to work correctly
[ ]:
import awswrangler as wr
import pandas as pd
import s3fs
import sagemaker

# to load the California housing dataset
from sklearn.datasets import fetch_california_housing
[ ]:
# Get SageMaker session & default S3 bucket
sagemaker_session = sagemaker.Session()
s3 = sagemaker_session.boto_session.resource("s3")
bucket = sagemaker_session.default_bucket()  # replace with your own bucket name if you have one
prefix = "data/tabular/california_housing"
filename = "california_housing.csv"
[ ]:
# helper functions to upload data to s3
def write_to_s3(filename, bucket, prefix):
    # put one file in a separate folder. This is helpful if you read and prepare data with Athena
    filename_key = filename.split(".")[0]
    key = "{}/{}/{}".format(prefix, filename_key, filename)
    return s3.Bucket(bucket).upload_file(filename, key)


def upload_to_s3(bucket, prefix, filename):
    url = "s3://{}/{}/{}".format(bucket, prefix, filename)
    print("Writing to {}".format(url))
    write_to_s3(filename, bucket, prefix)
[ ]:
# download files from tabular data source location
tabular_data = fetch_california_housing()
tabular_data_full = pd.DataFrame(tabular_data.data, columns=tabular_data.feature_names)
tabular_data_full["target"] = pd.DataFrame(tabular_data.target)
tabular_data_full.to_csv("california_housing.csv", index=False)
[ ]:
upload_to_s3(bucket, "data/tabular", filename)

Ingest Tabular Data from S3 bucket

Method 1: Copying data to the Instance

You can use AWS Command Line Interface (CLI) to copy your data from s3 to your SageMaker instance and copy files between your S3 buckets. This is a quick and easy approach when you are dealing with medium-sized data files, or you are experimenting and doing exploratory analysis. The documentation can be found here.

[ ]:
# copy data to your sagemaker instance using AWS CLI
!aws s3 cp s3://$bucket/$prefix/ $prefix/ --recursive
[ ]:
data_location = "{}/{}".format(prefix, filename)
tabular_data = pd.read_csv(data_location, nrows=5)
tabular_data.head()

Method 2: Use AWS compatible Python Packages

When you are dealing with large data sets, or do not want to lose any data when you delete your SageMaker Notebook Instance, you can use pre-built packages to access your files in S3 without copying files into your instance. These packages, such as Pandas, have implemented options to access data with a specified path string: while you will use file:// on your local file system, you will use s3:// instead to access the data through the AWS boto library. For pandas, any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected.You can find additional documentation here.

[ ]:
data_s3_location = "s3://{}/{}/{}".format(bucket, prefix, filename)  # S3 URL
s3_tabular_data = pd.read_csv(data_s3_location, nrows=5)
s3_tabular_data.head()

Method 3: Use AWS native methods

3.1 s3fs

S3Fs is a Pythonic file interface to S3. It builds on top of botocore. The top-level class S3FileSystem holds connection information and allows typical file-system style operations like cp, mv, ls, du, glob, etc., as well as put/get of local files to/from S3.

[ ]:
fs = s3fs.S3FileSystem()
data_s3fs_location = "s3://{}/{}/".format(bucket, prefix)
# To List all files in your accessible bucket
fs.ls(data_s3fs_location)
[ ]:
# open it directly with s3fs
data_s3fs_location = "s3://{}/{}/{}".format(bucket, prefix, filename)  # S3 URL
with fs.open(data_s3fs_location) as f:
    print(pd.read_csv(f, nrows=5))

3.2 AWS Data Wrangler

AWS Data Wrangler is an open-source Python library that extends the power of the Pandas library to AWS connecting DataFrames and AWS data related services (Amazon Redshift, AWS Glue, Amazon Athena, Amazon EMR, Amazon QuickSight, etc), which we will cover in later sections. It is built on top of other open-source projects like Pandas, Apache Arrow, Boto3, s3fs, SQLAlchemy, Psycopg2 and PyMySQL, and offers abstracted functions to execute usual ETL tasks like load/unload data from Data Lakes, Data Warehouses and Databases. Note that you would need s3fs version > 0.4.0 for the awswrangler csv reader to work.

[ ]:
data_wr_location = "s3://{}/{}/{}".format(bucket, prefix, filename)  # S3 URL
wr_data = wr.s3.read_csv(path=data_wr_location, nrows=5)
wr_data.head()