Ingest Image Data

When working on computer vision tasks, you may be using a common library such as OpenCV, matplotlib, or pandas. Once we are moving to cloud and start your machine learning journey in Amazon Sagemaker, you will encounter new challenges of loading, reading, and writing files from S3 to a Sagemaker Notebook, and we will discuss several approaches in this section. Due to the size of the data we are dealing with, copying data into the instance is not recommended; you do not need to download data to the Sagemaker to train a model either. But if you want to take a look at a few samples from the image dataset and decide whether any transformation/pre-processing is needed, here are ways to do it.

Image data: COCO (Common Objects in Context)

COCO is a large-scale object detection, segmentation, and captioning dataset. COCO has several features:

  • Object segmentation

  • Recognition in context

  • Superpixel stuff segmentation

  • 330K images (>200K labeled)

  • 1.5 million object instances

  • 80 object categories

  • 91 stuff categories

  • 5 captions per image

  • 250,000 people with keypoints

Set Up Notebook

[ ]:
%pip install -qU 'sagemaker>=2.15.0' 's3fs==0.4.2'
[ ]:
import io
import boto3
import sagemaker
import glob
import tempfile

# Get SageMaker session & default S3 bucket
sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()

prefix = "image_coco/coco_val/val2017"
filename = "000000086956.jpg"

Download image data and write to S3

Note: COCO data size is large so this could take around one minute or two. You can download partial files by using COCOAPI. We recommend to go with a bigger storage instance when you start your notebook instance if you are experimenting with the full dataset.

[ ]:
# helper functions to upload data to s3
def write_to_s3(bucket, prefix, filename):
    key = "{}/{}".format(prefix, filename)
    return boto3.Session().resource("s3").Bucket(bucket).upload_file(filename, key)
[ ]:
# run this cell if you are in SageMaker Studio notebook
#!apt-get install unzip
[ ]:
!wget http://images.cocodataset.org/zips/val2017.zip -O coco_val.zip
# Uncompressing
!unzip -qU -o coco_val.zip -d coco_val
[ ]:
# upload the files to the S3 bucket, we only upload 20 images to S3 bucket to showcase how ingestion works
csv_files = glob.glob("coco_val/val2017/*.jpg")
for filename in csv_files[:20]:
    write_to_s3(bucket, prefix, filename)

Method 1: Streaming data from S3 to the SageMaker instance-memory

Use AWS compatible Python Packages with io Module

The easiest way to access your files in S3 without copying files into your instance storage is to use pre-built packages that already have implemented options to access data with a specified path string. Streaming means to read the object directly to memory instead of writing it to a file. As an example, the matplotlib library has a pre-built function imread that usually an URL or path to an image, but here we use S3 objects and BytesIO method to read the image. You can also go with PIL package.

[ ]:
import matplotlib.image as mpimage
import matplotlib.pyplot as plt

key = "{}/{}".format(prefix, filename)
image_object = boto3.resource("s3").Bucket(bucket).Object(key)
image = mpimage.imread(io.BytesIO(image_object.get()["Body"].read()), "jpg")

plt.figure(0)
plt.imshow(image)
[ ]:
from PIL import Image

im = Image.open(image_object.get()["Body"])
plt.figure(0)
plt.imshow(im)

Method 2: Using temporary files on the SageMaker instance

Another way to work with your usual methods is to create temporary files on your SageMaker instance and feed them into the standard methods as a file path. Tempfiles provides automatic cleanup, meaning that creates temporary files that will be deleted as the file is closed.

[ ]:
tmp = tempfile.NamedTemporaryFile()
with open(tmp.name, "wb") as f:
    image_object.download_fileobj(f)
    f.seek(
        0, 2
    )  # the file will be downloaded in a lazy fashion, so add this to the file descriptor
    img = plt.imread(tmp.name)
    print(img.shape)
    plt.imshow(im)

Method 3: Use AWS native methods

s3fs

S3Fs is a Pythonic file interface to S3. It builds on top of botocore. The top-level class S3FileSystem holds connection information and allows typical file-system style operations like cp, mv, ls, du, glob, etc., as well as put/get of local files to/from S3.

[ ]:
import s3fs

fs = s3fs.S3FileSystem()
data_s3fs_location = "s3://{}/{}/".format(bucket, prefix)
# To List first file in your accessible bucket
fs.ls(data_s3fs_location)[0]
[ ]:
# open it directly with s3fs
data_s3fs_location = "s3://{}/{}/{}".format(bucket, prefix, filename)  # S3 URL
with fs.open(data_s3fs_location) as f:
    display(Image.open(f))

Citation

Lin, Tsung-Yi, Maire, Michael, Belongie, Serge, Bourdev, Lubomir, Girshick, Ross, Hays, James, Perona, Pietro, Ramanan, Deva, Zitnick, C. Lawrence and Dollár, Piotr Microsoft COCO: Common Objects in Context. (2014). , cite arxiv:1405.0312Comment: 1) updated annotation pipeline description and figures; 2) added new section describing datasets splits; 3) updated author list .