Preprocessing Text Data

The purpose of this notebook is to demonstrate how to preprocessing text data for next-step feature engineering and training a machine learning model via Amazon SageMaker. In this notebook we will focus on preprocessing our text data. We are going to discuss many possible methods to clean and enrich your text, but you do not need to run through every single step below. Usually, a rule of thumb is: if you are dealing with very noisy text, like social media text data, or nurse notes, then medium to heavy preprocessing effort might be needed, and if it’s domain-specific corpus, text enrichment is helpful as well; if you are dealing with long and well-written documents such as news articles and papers, very light preprocessing is needed; you can add some enrichment to the data to better capture the sentence to sentence relationship and overall meaning.


Input Format

Labeled text data sometimes are in a structured data format. You might come across this when working on reviews for sentiment analysis, news headlines for topic modeling, or documents for text classification. One column of the dataset could be dedicated for the label, one column for the text, and sometimes other columns as attributes. You can process this dataset format similar to how you would process tabular data (see Preprocessing Tabular Data for an example). Sometimes text data, especially raw text data, comes as unstructured data and is often in .json or .txt format. To work with this type of formatting, you will need to first extract useful information from the original dataset.

Use Cases

Text data contains rich information and it’s everywhere. Applicable use cases include Voice of Customer (VOC), fraud detection, warranty analysis, chatbot and customer service routing, audience analysis, and much more.

What’s the difference between preprocessing and feature engineering for text data?

In the preprocessing stage, you want to clean and transfer the text data from human language to standard, machine-analyzable format for further processing. For feature engineering, you extract predictive factors (features) from the text. For example, for a matching equivalent question pairs task, the features you can extract include words overlap, cosine similarity, inter-word relationships, parse tree structure similarity, TF-IDF (frequency-inverse document frequency) scores, etc.; for some language model like topic modeling, words embeddings themselves can also be features.

When is my text data ready for feature engineering?

When the data is ready to be vectorized and fit your specific use case.

Set Up Notebook

There are several python packages designed specifically for natural language processing (NLP) tasks. In this notebook, you will use the following packages:

  • nltk (natrual language toolkit), a leading platform includes multiple text processing libraries, which covers almost all aspects of preprocessing we will discuss in this section: tokenization, stemming, lemmatization, parsing, chunking, POS tagging, stop words, etc.

  • SpaCy, offers most functionality provided by nltk, and provides pre-trained word vectors and models. It is scalable and designed for production usage.

  • Gensim (Generate Similar), “designed specifically for topic modeling, document indexing, and similarity retrieval with large corpora”.

  • [TextBlo``](, offers POS tagging, noun phrases extraction, sentiment analysis, classification, parsing, n-grams, word inflation, all offered as an API to perform more advanced NLP tasks. It is an easy-to-use wrapper for libraries like``nltkandPattern`. We will use this package for our enrichment tasks.

[ ]:
! python -m pip install --upgrade pip
! pip install -U  'sagemaker>=2.15.0' spacy gensim==4.0.0 textblob emot==2.1 autocorrect
[ ]:
import nltk
import spacy
import gensim
from textblob import TextBlob
import re
import string
import glob
import sagemaker
[ ]:
# Get SageMaker session & default S3 bucket
sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()  # replace with your own bucket if you have one
s3 = sagemaker_session.boto_session.resource("s3")

prefix = "text_sentiment140/sentiment140"
filename = "training.1600000.processed.noemoticon.csv"

Downloading data from Online Sources

Text Data Sets: Twitter – sentiment140

Sentiment140 The sentiment140 dataset contains 1.6M tweets that were extracted using the Twitter API . The tweets have been annotated with sentiment (0 = negative, 4 = positive) and topics (hashtags used to retrieve tweets). The dataset contains the following columns: * target: the polarity of the tweet (0 = negative, 4 = positive) * ids: The id of the tweet ( 2087) * date: the date of the tweet (Sat May 16 23:58:44 UTC 2009) * flag: The query (lyx). If there is no query, then this value is NO_QUERY. * user: the user that tweeted (robotickilldozr) * text: the text of the tweet (Lyx is cool)

[ ]:
# helper functions to upload data to s3
def write_to_s3(filename, bucket, prefix):
    # put one file in a separate folder. This is helpful if you read and prepare data with Athena
    filename_key = filename.split(".")[0]
    key = "{}/{}/{}".format(prefix, filename_key, filename)
    return s3.Bucket(bucket).upload_file(filename, key)

def upload_to_s3(bucket, prefix, filename):
    url = "s3://{}/{}/{}".format(bucket, prefix, filename)
    print("Writing to {}".format(url))
    write_to_s3(filename, bucket, prefix)
[ ]:
# run this cell if you are in SageMaker Studio notebook
#!apt-get install unzip
[ ]:
!wget -O
# Uncompressing
!unzip -o -d sentiment140
[ ]:
# upload the files to the S3 bucket
csv_files = glob.glob("sentiment140/*.csv")
for filename in csv_files:
    upload_to_s3(bucket, "text_sentiment140", filename)

Read in Data

We will read the data in as .csv format since the text is embedded in a structured table.

Note: A frequent error when reading in text data is the encoding error. You can try different encoding options with pandas read_csv when “encoding as UTF-8” does not work; see python encoding documentation for more encodings you may encounter.

[ ]:
import pandas as pd
import boto3

prefix = "text_sentiment140/sentiment140"
filename = "training.1600000.processed.noemoticon.csv"
s3.Bucket(bucket).download_file(prefix + "/" + filename, filename)
# we will showcase with a smaller subset of data for demonstration purpose
text_data = pd.read_csv(filename, header=None, encoding="ISO-8859-1", low_memory=False, nrows=10000)
text_data.columns = ["target", "tw_id", "date", "flag", "user", "text"]

Examine Your Text Data

Here you will explore common methods and steps for text preprocessing. Text preprocessing is highly specific to each individual corpus and different tasks, so it is important to examine your text data first and decide what steps are necessary.

First, look at your text data. Seems like there are whitespaces to trim, URLs, smiley faces, numbers, abbreviations, spelling, names, etc. Tweets are less than 140 characters so there is less need for document segmentation and sentence dependencies.

[ ]:
pd.set_option("display.max_colwidth", None)  # show full content in a column


Step 1: Noise Removal

Start by removing noise from the text data. Removing noise is very task-specific, so you will usually pick and choose from the following to process your text data based on your needs: * Remove formatting (HTML, markup, metadata) – e.g. emails, web-scrapped data * Extract text data from full dataset – e.g. reviews, comments, labeled data from a nested JSON file or from structured data * Remove special characters * Remove emojis or convert emoji to words – e.g. reviews, tweets, Instagram and Facebook comments, SMS text with sales * Remove URLs – reviews, web content, emails * Convert accented characters to ASCII characters – e.g. tweets, contents that may contain foreign language

Note that preprocessing is an iterative process, so it is common to revisit any of these steps after you have cleaned and normalized your data.

Here you will look at tweets and decide how you are going to process URL, emojis and emoticons.

Working with text will often means dealing with regular expression. To freshen up on your regex or if you are new, Pythex is a good helper page for you to find cheatsheet and test your functions.

Noise Removal - Remove URLs

[ ]:
def remove_urls(text):
    url = re.compile(r"https?://\S+|www\.\S+")
    return url.sub(r"", text)

Let’s check if our code works with one example:

[ ]:
print("Removed URL:" + remove_urls(text_data["text"][0]))

Noise Removal - Remove emoticons, or convert emoticons to words

[ ]:
from emot.emo_unicode import UNICODE_EMO, EMOTICONS
[ ]:
def remove_emoticons(text):
    This function takes strings containing emoticons and returns strings with emoticons removed.
    Input(string): one tweet, contains emoticons
    Output(string): one tweet, emoticons removed, everything else unchanged
    emoticon = re.compile("(" + "|".join(k for k in EMOTICONS) + ")")
    return emoticon.sub(r"", text)
[ ]:
def convert_emoticons(text):
    This function takes strings containing emoticons and convert the emoticons to words that describe the emoticon.
    Input(string): one tweet, contains emoticons
    Output(string): one tweet, emoticons replaced with words describing the emoticon
    for emot in EMOTICONS:
        text = re.sub("(" + emot + ")", " ".join(EMOTICONS[emot].replace(",", "").split()), text)
    return text

Let’s check the results with one example and decide if we should keep the emoticon:

[ ]:
print("original text: " + remove_emoticons(text_data["text"][0]))
print("removed emoticons: " + convert_emoticons(text_data["text"][0]))

Assuming our task is sentiment analysis, then converting emoticons to words will be helpful. We will apply our remove_URL and convert_emoticons functions to the full dataset:

[ ]:
text_data["cleaned_text"] = text_data["text"].apply(remove_urls).apply(convert_emoticons)
[ ]:
text_data[["text", "cleaned_text"]][:1]

Step 2: Normalization

In the next step, we will further process the text so that all text/words will be put on the same level playing field: all the words should be in the same case, numbers should be also treated as strings, abbreviations and chat words should be recognizable and replaced with the full words, etc. This is important because we do not want two elements in our word list (dictionary) with the same meaning are taken as two non-related different words by machine, and when we eventually convert all words to numbers (vectors), these words will be noises to our model, such as “3” and “three”, “Our” and “our”, or “urs” and “yours”. This process often includes the following steps: ### Step 2.1 General Normalization * Convert all text to the same case * Remove punctuation * Convert numbers to word or remove numbers depending on your task * Remove white spaces * Convert abbreviations/slangs/chat words to word * Remove stop words (task specific and general English words); you can also create your own list of stop words * Remove rare words * Spelling correction

Note: some normalization processes are better to perform at sentence and document level, and some processes are word-level and should happen after tokenization and segmentation, which we will cover right after normalization.

Here you will convert the text to lower case, remove punctuation, remove numbers, remove white spaces, and complete other word-level processing steps after tokenizing the sentences.

Usually, this is a must for all language preprocessing. Since “Word” and “word” will essentially be considered two different elements in word representation, and we want words that have the same meaning to be represented the same in numbers (vectors), we want to convert all text into the same case.

[ ]:
text_data["text_lower"] = text_data["cleaned_text"].str.lower()
text_data[["cleaned_text", "text_lower"]][:1]

Depending on your use cases, you can either remove numbers or convert numbers into strings. If numbers are not important in your task (e.g. sentiment analysis) you can remove those, and in some cases, numbers are useful (e.g. date), and you can tag these numbers differently. In most pre-trained embeddings, numbers are treated as strings.

In this example, we are using Twitter data (tweets) and typically, numbers are not that important for understanding the meaning or content of a tweet. Therefore, we will remove the numbers.

[ ]:
def remove_numbers(text):
    This function takes strings containing numbers and returns strings with numbers removed.
    Input(string): one tweet, contains numbers
    Output(string): one tweet, numbers removed
    return re.sub(r"\d+", "", text)
[ ]:
# let's check the results of our function
[ ]:
text_data["normalized_text"] = text_data["text_lower"].apply(remove_numbers)

We can remove the mentions in the tweets, but if our task is to monitor VOC, it is helpful to extract the mentions data.

[ ]:
def remove_mentions(text):
    This function takes strings containing mentions and returns strings with
    mentions (@ and the account name) removed.
    Input(string): one tweet, contains mentions
    Output(string): one tweet, mentions (@ and the account name mentioned) removed
    mentions = re.compile(r"@\w+ ?")
    return mentions.sub(r"", text)
[ ]:
print("original text: " + text_data["text_lower"][0])
print("removed mentions: " + remove_mentions(text_data["text_lower"][0]))
[ ]:
def extract_mentions(text):
    This function takes strings containing mentions and returns strings with
    mentions (@ and the account name) extracted into a different element,
    and removes the mentions in the original sentence.
    Input(string): one sentence, contains mentions
    one tweet (string): mentions (@ and the account name mentioned) removed
    mentions (string): (only the account name mentioned) extracted
    mentions = [i[1:] for i in text.split() if i.startswith("@")]
    sentence = re.compile(r"@\w+ ?").sub(r"", text)
    return sentence, mentions
[ ]:
text_data["normalized_text"], text_data["mentions"] = zip(
[ ]:
text_data[["text", "normalized_text", "mentions"]].head(1)

We will use the string.punctuation in python to remove punctuations, which contains the following punctuation symbols!"#$%&\'()*+,-./:;<=>?@[\\]^_{|}~ , you can add or remove more as needed.

[ ]:
punc_list = string.punctuation  # you can self define list of punctuation to remove here

def remove_punctuation(text):
    This function takes strings containing self defined punctuations and returns
    strings with punctuations removed.
    Input(string): one tweet, contains punctuations in the self-defined list
    Output(string): one tweet, self-defined punctuations removed
    translator = str.maketrans("", "", punc_list)
    return text.translate(translator)
[ ]:
[ ]:
text_data["normalized_text"] = text_data["normalized_text"].apply(remove_punctuation)

You can also use trim functions to trim whitespaces from left and right or in the middle. Here we will just simply utilize the split function to extract all words from our text since we already removed all special characters, and combine them with a single whitespace.

[ ]:
def remove_whitespace(text):
    This function takes strings containing mentions and returns strings with
    whitespaces removed.
    Input(string): one tweet, contains whitespaces
    Output(string): one tweet, white spaces removed
    return " ".join(text.split())
[ ]:
print("original text: " + text_data["normalized_text"][2])
print("removed whitespaces: " + remove_whitespace(text_data["normalized_text"][2]))
[ ]:
text_data["normalized_text"] = text_data["normalized_text"].apply(remove_whitespace)

Step 3: Tokenization and Segmentation

After we extracted useful text data from the full dataset, we will split large chunks of text (documents) into sentences, and sentences into words. Most of the times we will use sentence-ending punctuation to split documents into sentences, but it can be ambiguous especially when we are dealing with character conversations (“Are you alright?” said Ron), abbreviations (Dr. Fay would like to see Mr. Smith now.) and other special use cases. There are Python libraries designed for this task (check textsplit), but you can take your own approach depending on your context.

Here for Twitter data, we are only dealing with sentences shorter than 140 characters, so we will just tokenize sentences into words. We do want to normalize the sentence before tokenizing sentences into words, so we will introduce normalization, and tokenize our tweets into words after normalizing sentences.

Tokenizing Sentences into Words

[ ]:"punkt")
[ ]:
from nltk.tokenize import word_tokenize

def tokenize_sent(text):
    This function takes strings (a tweet) and returns tokenized words.
    Input(string): one tweet
    Output(list): list of words tokenized from the tweet
    word_tokens = word_tokenize(text)
    return word_tokens
[ ]:
text_data["tokenized_text"] = text_data["normalized_text"].apply(tokenize_sent)
[ ]:
text_data[["normalized_text", "tokenized_text"]][:1]

Continuing Word-level Normalization

Remove Stop Words

Stop words are common words that does not contribute to the meaning of a sentence, such as ‘the’, ‘a’, ‘his’. Most of the time we can remove these words without harming further analysis, but if you want to apply Part-of-Speech (POS) tagging later, be careful with what you removed in this step as they can provide valuable information. You can also add stop words to the list based on your use cases.

[ ]:"stopwords")
from nltk.corpus import stopwords
[ ]:
stopwords_list = set(stopwords.words("english"))

One way to add words to your stopwords list is to check for most frequent words, especially if you are working with a domain-specific corpus and those words sometimes are not covered by general English stop words. You can also remove rare words from your text data.

Let’s check for the most common words in our data. All the words we see in the following example are covered in general English stop words, so we will not add any additional stop words.

[ ]:
from collections import Counter

counter = Counter()
for word in [w for sent in text_data["tokenized_text"] for w in sent]:
    counter[word] += 1

Let’s check for the rarest words now. In this example, infrequently used words mostly consist of misspelled words, which we will later correct, but we can add them to our stop words list as well.

[ ]:
# least frequent words
[ ]:
top_n = 10
bottom_n = 10
stopwords_list |= set([word for (word, count) in counter.most_common(top_n)])
stopwords_list |= set([word for (word, count) in counter.most_common()[:-bottom_n:-1]])
stopwords_list |= {"thats"}

def remove_stopwords(tokenized_text):
    This function takes a list of tokenized words from a tweet, removes self-defined stop words from the list,
    and returns the list of words with stop words removed
    Input(list): a list of tokenized words from a tweet, contains stop words
    Output(list): a list of words with stop words removed
    filtered_text = [word for word in tokenized_text if word not in stopwords_list]
    return filtered_text
[ ]:
[ ]:
text_data["tokenized_text"] = text_data["tokenized_text"].apply(remove_stopwords)

Convert Abbreviations, slangs and chat words into words

Sometimes you will need to develop your own mapping for abbreviations/slangs <-> words, for chat data, or for domain-specific data where abbreviations often have different meanings from what is commonly used.

[ ]:
chat_words_map = {
    "idk": "i do not know",
    "btw": "by the way",
    "imo": "in my opinion",
    "u": "you",
    "oic": "oh i see",
chat_words_list = set(chat_words_map)
[ ]:
def translator(text):
    This function takes a list of tokenized words, finds the chat words in the self-defined chat words list,
    and replace the chat words with the mapped full expressions. It returns the list of tokenized words with
    chat words replaced.
    Input(list): a list of tokenized words from a tweet, contains chat words
    Output(list): a list of words with chat words replaced by full expressions
    new_text = []
    for w in text:
        if w in set(chat_words_map):
            new_text = new_text + chat_words_map[w].split()
    return new_text
[ ]:
[ ]:
text_data["tokenized_text"] = text_data["tokenized_text"].apply(translator)

Spelling Correction

Some common spelling correction packages include SpellChecker and autocorrect. It might take some time to spell check every sentence of the text, so you can decide if a spell check is absolutely necessary. If you are dealing with documents (news, papers, articles) generally it is not necessary; but if you are dealing with chat data, reviews, notes, it might be a good idea to spell check your text.

[ ]:
from autocorrect import Speller
[ ]:
spell = Speller(lang="en", fast=True)

def spelling_correct(tokenized_text):
    This function takes a list of tokenized words from a tweet, spell check every words and returns the
    corrected words if applicable. Note that not every wrong spelling words will be identified especially
    for tweets.
    Input(list): a list of tokenized words from a tweet, contains wrong-spelling words
    Output(list): a list of corrected words
    corrected = [spell(word) for word in tokenized_text]
    return corrected
[ ]:
[ ]:
text_data["tokenized_text"] = text_data["tokenized_text"].apply(spelling_correct)

Step 3.2 Stemming and Lemmatization

Stemming is the process of removing affixes from a word to get a word stem, and lemmatization can in principle select the appropriate lemma depending on the context. The difference is that a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words that have different meanings depending on part of speech. However, stemmers are typically easier to implement and run faster, and the reduced accuracy may not matter for some applications.


There are several stemming algorithms available, and the most popular ones are Porter, Lancaster, and Snowball. Porter is the most common one, Snowball is an improvement over Porter, and Lancaster is more aggressive. You can check for more algorithms provided by nltk here.

[ ]:
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize

stemmer = SnowballStemmer("english")

def stem_text(tokenized_text):
    This function takes a list of tokenized words from a tweet, and returns the stemmed words by your
    defined stemmer.
    Input(list): a list of tokenized words from a tweet
    Output(list): a list of stemmed words in its root form
    stems = [stemmer.stem(word) for word in tokenized_text]
    return stems


[ ]:"wordnet")
[ ]:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

lemmatizer = WordNetLemmatizer()

def lemmatize_text(tokenized_text):
    This function takes a list of tokenized words from a tweet, and returns the lemmatized words.
    you can also provide context for lemmatization, i.e. part-of-speech.
    Input(list): a list of tokenized words from a tweet
    Output(list): a list of lemmatized words in its base form
    lemmas = [lemmatizer.lemmatize(word, pos="v") for word in tokenized_text]
    return lemmas

Let’s compare our stemming and lemmatization results:

It seems like both processes returned similar results besides some verb being trimmed differently, so it is okay to go with stemming in this case if you are dealing with a lot of data and want a better performance. You can also keep both and experiment with further feature engineering and modeling to see which one produces better results.

[ ]:

It seems that a stemmer can do the work for our tweets data. You can keep both and decide which one you want to use for feature engineering and modeling.

[ ]:
text_data["stem_text"] = text_data["tokenized_text"].apply(stem_text)
text_data["lemma_text"] = text_data["tokenized_text"].apply(lemmatize_text)

Step 3.5: Re-examine the results

Take a pause here and examine the results from previous steps to decide if more noise removal/normalization is needed. In this case, you might want to add more words to the stop words list, spell-check more aggressively, or add more mappings to the abbreviation/slang to words list.

[ ]:
text_data.sample(5)[["text", "stem_text", "lemma_text"]]

Step 4: Enrichment and Augmentation

After you have cleaned and tokenized your text data into a standard form, you might want to enrich it with more useful information that was not provided directly in the original text or its single-word form. For example: * Part-of-speech tagging * Extracting phrases * Name entity recognition * Dependency parsing * Word level embeddings

Many Python Packages including nltk, SpaCy, CoreNLP, and here we will use TextBlob to illustrate some enrichment methods.

Part-of-Speech tagging can assign each word in accordance with its syntactic functions (noun, verb, adjectives, etc.).

[ ]:"averaged_perceptron_tagger")
[ ]:
text_example = text_data.sample()["lemma_text"]
[ ]:
from textblob import TextBlob

result = TextBlob(" ".join(text_example.values[0]))

Sometimes words come in as phrases (noun group phrases, verb group phrases, etc.) and often have discrete grammatical meanings. Extract those words as phrases rather than separate words in this case.

[ ]:"brown")
[ ]:
# orginal text:
text_example = text_data.sample()["lemma_text"]
" ".join(text_example.values[0])
[ ]:
# noun phrases that can be extracted from this sentence
result = TextBlob(" ".join(text_example.values[0]))
for nouns in result.noun_phrases:

You can use pre-trained/pre-defined name entity recognition models to find named entities in text and classify them into pre-defined categories. You can also train your own NER model, especially if you are dealing with domain specific context.

[ ]:"maxent_ne_chunker")"words")
[ ]:
text_example_enr = text_data.sample()["lemma_text"].values[0]
print("original text: " + " ".join(text_example_enr))
[ ]:
from nltk import pos_tag, ne_chunk


Final Dataset ready for feature engineering and modeling

For this notebook you cleaned and normalized the data, kept mentions as a separate column, and stemmed and lemmatized the tokenized words. You can experiment with these two results to see which one gives you a better model performance.

Twitter data is short and often does not have complex syntax structures, so no enrichment (POS tagging, parsing, etc.) was done at this time; but you can experiment with those when you have more complicated text data.

[ ]:

Save our final dataset to S3 for further process

[ ]:
filename_write_to = "processed_sentiment_140.csv"
text_data.to_csv(filename_write_to, index=False)
upload_to_s3(bucket, "text_sentiment140_processed", filename_write_to)


Congratulations! You cleaned and prepared your text data and it is now ready to be vectorized or used for feature engineering.

[ ]: