Music Recommender Data Exploration

This notebook’s CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

Background

This notebook is part of a notebook series that goes through the ML lifecycle and shows how we can build a Music Recommender System using a combination of SageMaker services and features. In this notebook, we will be focusing on exploring the data. It is the first notebook in a series of notebooks. You can choose to run this notebook by itself or in sequence with the other notebooks listed below. Please see the README.md for more information about this use case implement of this sequence of notebooks.

[ ]:

import sys
import pprint

sys.path.insert(1, "./code")

[ ]:

# update pandas to avoid data type issues in older 1.0 version
!pip install pandas --upgrade --quiet
import pandas as pd

print(pd.__version__)

[ ]:

# create data folder
!mkdir data

[ ]:

import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

import json
import sagemaker
import boto3
import os

# Sagemaker session
sess = sagemaker.Session()
# get session bucket name
bucket = sess.default_bucket()
# bucket prefix or the subfolder for everything we produce
prefix = "music-recommendation"
# s3 client
s3_client = boto3.client("s3")

print(f"this is your default SageMaker Studio bucket name: {bucket}")

Prereqs: Get Data

Here we will download the music data from a public S3 bucket that we’ll be using for this demo and uploads it to your default S3 bucket that was created for you when you initially created a SageMaker Studio workspace.

[ ]:

from demo_helpers import get_data, get_model, update_data_sources

[ ]:

# public S3 bucket that contains our music data
s3_bucket_music_data = (
    f"s3://sagemaker-example-files-prod-{sess.boto_region_name}/datasets/tabular/synthetic-music"
)

[ ]:

new_data_paths = get_data(
    s3_client,
    [f"{s3_bucket_music_data}/tracks.csv", f"{s3_bucket_music_data}/ratings.csv"],
    bucket,
    prefix,
    sample_data=0.70,
)
print(new_data_paths)

[ ]:

# these are the new file paths located on your SageMaker Studio default s3 storage bucket
tracks_data_source = f"s3://{bucket}/{prefix}/tracks.csv"
ratings_data_source = f"s3://{bucket}/{prefix}/ratings.csv"

Update the Data Source in the .flow File

The 01_music_dataprep.flow file is a JSON file containing instructions for where to find your data sources and how to transform the data. We’ll be updating the object telling Data Wrangler where to find the input data on S3. We will set this to your default S3 bucket. With this update to the .flow file it now points to your new S3 bucket as the data source used by SageMaker Data Wrangler.

Make sure the .flow file is closed before running this next step or it won’t update the new s3 file locations in the file

[ ]:

update_data_sources("01_music_dataprep.flow", tracks_data_source, ratings_data_source)

Explore the Data

[ ]:

tracks = pd.read_csv("./data/tracks.csv")
ratings = pd.read_csv("./data/ratings.csv")

[ ]:

tracks.head()

[ ]:

ratings.head()

[ ]:

print("{:,} different songs/tracks".format(tracks["trackId"].nunique()))
print("{:,} users".format(ratings["userId"].nunique()))
print("{:,} user rating events".format(ratings["ratingEventId"].nunique()))

[ ]:

tracks.groupby("genre")["genre"].count().plot.bar(title="Tracks by Genre");

Create some new data to ingest later

[ ]:

tracks_new = tracks[:300]
ratings_new = ratings[:1000]

# export dataframes to csv
tracks_new.to_csv("./data/tracks_new.csv", index=False)
ratings_new.to_csv("./data/ratings_new.csv", index=False)

[ ]:

s3_client.upload_file(
    Filename="./data/tracks_new.csv", Bucket=bucket, Key=f"{prefix}/data/tracks_new.csv"
)
s3_client.upload_file(
    Filename="./data/ratings_new.csv", Bucket=bucket, Key=f"{prefix}/data/ratings_new.csv"
)

Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.