Ingest Text Data

This notebook’s CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

Labeled text data can be in a structured data format, such as reviews for sentiment analysis, news headlines for topic modeling, or documents for text classification. In these cases, you may have one column for the label, one column for the text, and sometimes other columns for attributes. You can treat this structured data like tabular data. Sometimes text data, especially raw text data comes as unstructured data and is often in .json or .txt format, and we will discuss how to ingest these types of data files into a SageMaker Notebook in this section.

Set Up Notebook

[ ]:

%pip install -q 's3fs==0.4.2'

[ ]:

import pandas as pd
import json
import glob
import s3fs
import sagemaker

[ ]:

# Get SageMaker session & default S3 bucket
sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()  # replace with your own bucket if you have one
s3 = sagemaker_session.boto_session.resource("s3")

prefix = "text_spam/spam"
prefix_json = "json_jeo"
filename = "SMSSpamCollection.txt"
filename_json = "JEOPARDY_QUESTIONS1.json"

Downloading data from Online Sources

Text data (in structured .csv format): Twitter – sentiment140

Sentiment140 This is the sentiment140 dataset. It contains 1.6M tweets extracted using the twitter API. The tweets have been annotated with sentiment (0 = negative, 4 = positive) and topics (hashtags used to retrieve tweets). The dataset contains the following columns: * target: the polarity of the tweet (0 = negative, 4 = positive) * ids: The id of the tweet ( 2087) * date: the date of the tweet (Sat May 16 23:58:44 UTC 2009) * flag: The query (lyx). If there is no query, then this value is NO_QUERY. * user: the user that tweeted (robotickilldozr) * text: the text of the tweet (Lyx is cool

Second Twitter data is a Twitter data set collected as an extension to Sanders Analytics Twitter sentiment corpus, originally designed for training and testing Twitter sentiment analysis algorithms. We will use this data to showcase how to aggregate two data sets if you want to enhance your current data set by adding more data to it.

[ ]:

# helper functions to upload data to s3
def write_to_s3(filename, bucket, prefix):
    # put one file in a separate folder. This is helpful if you read and prepare data with Athena
    key = "{}/{}".format(prefix, filename)
    return s3.Bucket(bucket).upload_file(filename, key)


def upload_to_s3(bucket, prefix, filename):
    url = "s3://{}/{}/{}".format(bucket, prefix, filename)
    print("Writing to {}".format(url))
    write_to_s3(filename, bucket, prefix)

[ ]:

# run this cell if you are in SageMaker Studio notebook
#!apt-get install unzip

[ ]:

# download first twitter dataset
!wget http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip -O sentimen140.zip
# Uncompressing
!unzip -o sentimen140.zip -d sentiment140

[ ]:

# upload the files to the S3 bucket
csv_files = glob.glob("sentiment140/*.csv")
for filename in csv_files:
    upload_to_s3(bucket, "text_sentiment140", filename)

[ ]:

# download second twitter dataset
!wget https://raw.githubusercontent.com/zfz/twitter_corpus/master/full-corpus.csv

[ ]:

filename = "full-corpus.csv"
upload_to_s3(bucket, "text_twitter_sentiment_2", filename)

Text data (in .txt format): SMS Spam data

SMS Spam Data was manually extracted from the Grumbletext Web site. This is a UK forum in which cell phone users make public claims about SMS spam messages, most of them without reporting the very spam message received. Each line in the text file has the correct class followed by the raw message. We will use this data to showcase how to ingest text data in .txt format.

[ ]:

!wget http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/smsspamcollection.zip -O spam.zip
!unzip -o spam.zip -d spam

[ ]:

txt_files = glob.glob("spam/*.txt")
for filename in txt_files:
    upload_to_s3(bucket, "text_spam", filename)

Text Data (in .json format): Jeopardy Question data

Jeopardy Question was obtained by crawling the Jeopardy question archive website. It is an unordered list of questions where each question has the following key-value pairs:

category : the question category, e.g. “HISTORY”
value: dollar value of the question as string, e.g. “$200”
question: text of question
answer : text of answer
round: one of “Jeopardy!”,“Double Jeopardy!”,“Final Jeopardy!” or “Tiebreaker”
show_number : string of show number, e.g ‘4680’
air_date : the show air date in format YYYY-MM-DD

[ ]:

# json file format
! wget 'https://docs.google.com/uc?export=download&id=0BwT5wj_P7BKXb2hfM3d2RHU1ckE' -O JEOPARDY_QUESTIONS1.json
# Uncompressing
filename = "JEOPARDY_QUESTIONS1.json"
upload_to_s3(bucket, "json_jeo", filename)

Ingest Data into Sagemaker Notebook

Method 1: Copying data to the Instance

You can use the AWS Command Line Interface (CLI) to copy your data from s3 to your SageMaker instance. This is a quick and easy approach when you are dealing with medium sized data files, or you are experimenting and doing exploratory analysis. The documentation can be found here.

[ ]:

# Specify file names
prefix = "text_spam/spam"
prefix_json = "json_jeo"
filename = "SMSSpamCollection.txt"
filename_json = "JEOPARDY_QUESTIONS1.json"
prefix_spam_2 = "text_spam/spam_2"

[ ]:

# copy data to your sagemaker instance using AWS CLI
!aws s3 cp s3://$bucket/$prefix_json/ text/$prefix_json/ --recursive

[ ]:

data_location = "text/{}/{}".format(prefix_json, filename_json)
with open(data_location) as f:
    data = json.load(f)
    print(data[0])

Method 2: Use AWS compatible Python Packages

When you are dealing with large data sets, or do not want to lose any data when you delete your Sagemaker Notebook Instance, you can use pre-built packages to access your files in S3 without copying files into your instance. These packages, such as Pandas, have implemented options to access data with a specified path string: while you will use file:// on your local file system, you will use s3:// instead to access the data through the AWS boto library. For pandas, any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. You can find additional documentation here.

For text data, most of the time you can read it as line-by-line files or use Pandas to read it as a DataFrame by specifying a delimiter.

[ ]:

data_s3_location = "s3://{}/{}/{}".format(bucket, prefix, filename)  # S3 URL
s3_tabular_data = pd.read_csv(data_s3_location, sep="\t", header=None)
s3_tabular_data.head()

For JSON files, depending on the structure, you can also use Pandas read_json function to read it if it’s a flat json file.

[ ]:

data_json_location = "s3://{}/{}/{}".format(bucket, prefix_json, filename_json)
s3_tabular_data_json = pd.read_json(data_json_location, orient="records")
s3_tabular_data_json.head()

Method 3: Use AWS Native methods

S3Fs is a Pythonic file interface to S3. It builds on top of botocore. The top-level class S3FileSystem holds connection information and allows typical file-system style operations like cp, mv, ls, du, glob, etc., as well as put/get of local files to/from S3.

[ ]:

fs = s3fs.S3FileSystem()
data_s3fs_location = "s3://{}/{}/".format(bucket, prefix)
# To List all files in your accessible bucket
fs.ls(data_s3fs_location)

[ ]:

# open it directly with s3fs
data_s3fs_location = "s3://{}/{}/{}".format(bucket, prefix, filename)  # S3 URL
with fs.open(data_s3fs_location) as f:
    print(pd.read_csv(f, sep="\t", nrows=2))

Aggregating datasets

If you would like to enhance your data with more data collected for your use cases, you can always aggregate your newly-collected data with your current dataset. We will use two datasets – Sentiment140 and Sanders Twitter Sentiment to show how to aggregate data together.

[ ]:

prefix_tw1 = "text_sentiment140/sentiment140"
filename_tw1 = "training.1600000.processed.noemoticon.csv"
prefix_added = "text_twitter_sentiment_2"
filename_added = "full-corpus.csv"

Let’s read in our original data and take a look at its format and schema:

[ ]:

data_s3_location_base = "s3://{}/{}/{}".format(bucket, prefix_tw1, filename_tw1)  # S3 URL
# we will showcase with a smaller subset of data for demonstration purpose
text_data = pd.read_csv(
    data_s3_location_base, header=None, encoding="ISO-8859-1", low_memory=False, nrows=10000
)
text_data.columns = ["target", "tw_id", "date", "flag", "user", "text"]

We have 6 columns, date, text, flag (which is the topic the twitter was queried), tw_id (tweet’s id), user (user account name), and target (0 = neg, 4 = pos).

[ ]:

text_data.head(1)

Let’s read in and take a look at the data we want to add to our original data.

We will start by checking for columns for both data sets. The new data set has 5 columns, TweetDate which maps to date, TweetText which maps to text, Topic which maps to flag, TweetId which maps to tw_id, and Sentiment mapped to target. In this new data set, we don’t have user account name column, so when we aggregate two data sets we can add this column to the data set to be added and fill it with NULL values. You can also remove this column from the original data if it does not provide much valuable information based on your use cases.

[ ]:

data_s3_location_added = "s3://{}/{}/{}".format(bucket, prefix_added, filename_added)  # S3 URL
# we will showcase with a smaller subset of data for demonstration purpose
text_data_added = pd.read_csv(
    data_s3_location_added, encoding="ISO-8859-1", low_memory=False, nrows=10000
)

[ ]:

text_data_added.head(1)

[ ]:

text_data_added["user"] = ""

[ ]:

text_data_added.columns = ["flag", "target", "tw_id", "date", "text", "user"]
text_data_added.head(1)

Note that the target column in the new data set is marked as “positive”, “negative”, “neutral”, and “irrelevant”, whereas the target in the original data set is marked as “0” and “4”. So let’s map “positive” to 4, “neutral” to 2, and “negative” to 0 in our new data set so that they are consistent. For “irrelevant”, which are either not English or Spam, you can either remove these if it is not valuable for your use case (In our use case of sentiment analysis, we will remove those since these text does not provide any value in terms of predicting sentiment) or map them to -1.

[ ]:

# remove tweets labeled as irelevant
text_data_added = text_data_added[text_data_added["target"] != "irelevant"]
# convert strings to number targets
target_map = {"positive": 4, "negative": 0, "neutral": 2}
text_data_added["target"] = text_data_added["target"].map(target_map)

[ ]:

text_data_new = pd.concat([text_data, text_data_added])
filename = "sentiment_full.csv"
text_data_new.to_csv(filename, index=False)
upload_to_s3(bucket, "text_twitter_sentiment_full", filename)

Citation

Twitter140 Data, Go, A., Bhayani, R. and Huang, L., 2009. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 1(2009), p.12.

SMS Spaming data, Almeida, T.A., Gómez Hidalgo, J.M., Yamakami, A. Contributions to the Study of SMS Spam Filtering: New Collection and Results. Proceedings of the 2011 ACM Symposium on Document Engineering (DOCENG’11), Mountain View, CA, USA, 2011.

J! Archive, J! Archive is created by fans, for fans. The Jeopardy! game show and all elements thereof, including but not limited to copyright and trademark thereto, are the property of Jeopardy Productions, Inc. and are protected under law. This website is not affiliated with, sponsored by, or operated by Jeopardy Productions, Inc.

Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.