Pre-Processing Text With NLTK | Cheatsheet
NLTK (Natural Language Toolkit) is a library in Python that provides tools for working with human language data. It includes modules for tasks such as tokenization, stemming, and tagging, as well as more advanced natural language processing tasks like parsing and semantic analysis. NLTK also includes a large number of data sets and resources such as corpora, grammars, and wordnet, which can be used to train and test models or to support research in natural language processing.
Why Is Pre-processing Important?
Pre-processing is an important step in natural language processing because it helps to ensure that the input data is in a format that can be effectively and efficiently processed by the algorithms used in the analysis. By pre-processing the text data, we can improve the effectiveness of the subsequent analysis by reducing noise and inconsistencies in the data, and by making it easier for the algorithms to identify patterns and relationships in the data.
Tokenization Of Text With NLTK
Tokenization is the process of breaking down text into individual words, phrases, or other meaningful units, called tokens. The process of tokenization is the first step in many natural language processing tasks, such as text classification, sentiment analysis, and language translation.
Tokenize a sentence into words:
text = "This is an example sentence."
words = word_tokenize(text)
print(words)
Tokenize a text into sentences
text = "This is the first sentence. This is the second sentence."
sentences = sent_tokenize(text)
print(sentences)
Tokenize a text using a regular expression
from nltk.tokenize import RegexpTokenizer
text = "This is a text with numbers like 1234 and special characters like #$%^&*."
tokenizer = RegexpTokenizer(r'\w+')
words = tokenizer.tokenize(text)
print(words)
Tokenize a text using NLTK's TweetTokenizer
from nltk.tokenize import TweetTokenizer
text = "This is a tweet with #hashtags and @mentions."
tokenizer = TweetTokenizer()
words = tokenizer.tokenize(text)
print(words)
Tokenize a text using NLTK's Treebank tokenizer
from nltk.tokenize import TreebankWordTokenizer
text = "This is a text with contractions like can't and don't."
tokenizer = TreebankWordTokenizer()
words = tokenizer.tokenize(text)
print(words)
Removing Stopwords With NLTK
Stopwords are removed from text because they can add a lot of noise to the data and make it harder for algorithms to identify meaningful patterns and relationships. Additionally, stopwords are often not useful for many natural languages processing tasks such as text classification and sentiment analysis.
Removing stopwords can also help to reduce the dimensionality of the data, which can make it more computationally efficient to process. It also helps in reducing the vector space of the text data, which in turn can improve the effectiveness of the subsequent analysis.
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
stop_words = set(stopwords.words("english"))
text = "This is an example sentence with stopwords."
words = word_tokenize(text)
filtered_words = [word for word in words if word.lower() not in stop_words]
print(filtered_words)
Stemming and Lemmatization With NLTK
Stemming and Lemmatization are important in natural language processing because they help to reduce the dimensionality of the data by reducing words to their base form. This can make it easier for algorithms to identify patterns and relationships in the text.
Stemming is the process of reducing words to their base form by removing suffixes, such as "-ing", "-ed", or "-ly". It can be useful for text classification, information retrieval, and other natural language processing tasks where the meaning of the word is more important than the exact form of the word.
Lemmatization, on the other hand, is the process of reducing words to their base form by taking into account their context and the meaning of the word. It is based on the idea of grouping together different inflected forms of a word so they can be analysed as a single item. Lemmatization can be more accurate than stemming because it takes into account the context and meaning of the word, but it is also more computationally expensive.
import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize
# Download if required
# nltk.download('wordnet')
text = "This is an example sentence for stemming."
words = word_tokenize(text)
# Stemming
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(word) for word in words]
# lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
print(lemmatized_words)
Removing Special Characters With NLTK
Removing special characters is an important step in natural language processing because special characters can add noise to the data and make it harder for algorithms to identify meaningful patterns and relationships. Special characters can include punctuation marks, symbols, and other non-alphanumeric characters.
Special characters can also cause issues with text processing algorithms, such as tokenization and stemming, as they can be misinterpreted as separate tokens or affect the stemming process.
import re
pattern = '[^a-zA-Z0-9\s]'
text = "This is an example sentence with special characters like [email protected]#."
filtered_text = re.sub(pattern, '', text)
print(filtered_text)
Conclusion
The above techniques should be enough to get you started. Preprocessing in important to prepare data for various activities like clustering(check my blog on Agglomerative Clustering), classification and other Machine Learning Tasks. You should know these pre-processing techniques like the back of your hand.