Music Recommender Data Exploration


Background

This notebook is part of a notebook series that goes through the ML lifecycle and shows how we can build a Music Recommender System using a combination of SageMaker services and features. In this notebook, we will be focusing on exploring the data. It is the first notebook in a series of notebooks. You can choose to run this notebook by itself or in sequence with the other notebooks listed below. Please see the README.md for more information about this use case implement of this sequence of notebooks.

  1. Music Recommender Data Exploration (current notebook)

  2. Music Recommender Data Preparation with SageMaker Feature Store and SageMaker Data Wrangler

  3. Train, Deploy, and Monitor the Music Recommender Model using SageMaker SDK


Contents

  1. Prereqs: Get Data

  2. Update the Data Source in the .flow File

  3. Explore the Data

[ ]:
import sys
import pprint

sys.path.insert(1, "./code")
[ ]:
# update pandas to avoid data type issues in older 1.0 version
!pip install pandas --upgrade --quiet
import pandas as pd

print(pd.__version__)
[ ]:
# create data folder
!mkdir data
[ ]:
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

import json
import sagemaker
import boto3
import os

# Sagemaker session
sess = sagemaker.Session()
# get session bucket name
bucket = sess.default_bucket()
# bucket prefix or the subfolder for everything we produce
prefix = "music-recommendation"
# s3 client
s3_client = boto3.client("s3")

print(f"this is your default SageMaker Studio bucket name: {bucket}")

Prereqs: Get Data


Here we will download the music data from a public S3 bucket that we’ll be using for this demo and uploads it to your default S3 bucket that was created for you when you initially created a SageMaker Studio workspace.

[ ]:
from demo_helpers import get_data, get_model, update_data_sources
[ ]:
# public S3 bucket that contains our music data
s3_bucket_music_data = "s3://sagemaker-sample-files/datasets/tabular/synthetic-music"
[ ]:
new_data_paths = get_data(
    s3_client,
    [f"{s3_bucket_music_data}/tracks.csv", f"{s3_bucket_music_data}/ratings.csv"],
    bucket,
    prefix,
    sample_data=0.70,
)
print(new_data_paths)
[ ]:
# these are the new file paths located on your SageMaker Studio default s3 storage bucket
tracks_data_source = f"s3://{bucket}/{prefix}/tracks.csv"
ratings_data_source = f"s3://{bucket}/{prefix}/ratings.csv"

Update the Data Source in the .flow File


The 01_music_dataprep.flow file is a JSON file containing instructions for where to find your data sources and how to transform the data. We’ll be updating the object telling Data Wrangler where to find the input data on S3. We will set this to your default S3 bucket. With this update to the .flow file it now points to your new S3 bucket as the data source used by SageMaker Data Wrangler.

Make sure the .flow file is closed before running this next step or it won’t update the new s3 file locations in the file

[ ]:
update_data_sources("01_music_dataprep.flow", tracks_data_source, ratings_data_source)

Explore the Data


[ ]:
tracks = pd.read_csv("./data/tracks.csv")
ratings = pd.read_csv("./data/ratings.csv")
[ ]:
tracks.head()
[ ]:
ratings.head()
[ ]:
print("{:,} different songs/tracks".format(tracks["trackId"].nunique()))
print("{:,} users".format(ratings["userId"].nunique()))
print("{:,} user rating events".format(ratings["ratingEventId"].nunique()))
[ ]:
tracks.groupby("genre")["genre"].count().plot.bar(title="Tracks by Genre");
[ ]:
ratings[["ratingEventId", "userId"]].plot.hist(
    by="userId", bins=50, title="Distribution of # of Ratings by User"
)

Create some new data to ingest later

[ ]:
tracks_new = tracks[:300]
ratings_new = ratings[:1000]

# export dataframes to csv
tracks_new.to_csv("./data/tracks_new.csv", index=False)
ratings_new.to_csv("./data/ratings_new.csv", index=False)
[ ]:
s3_client.upload_file(
    Filename="./data/tracks_new.csv", Bucket=bucket, Key=f"{prefix}/data/tracks_new.csv"
)
s3_client.upload_file(
    Filename="./data/ratings_new.csv", Bucket=bucket, Key=f"{prefix}/data/ratings_new.csv"
)