πŸ’¬ TEXT PIPELINE

NLP: Data Cleansing & Applications

Clean raw text and use it for tokenization, POS tagging, NER, and simple applications like sentiment or classification.

Why Data Cleansing?

πŸ‘Ά In Simple Terms

Raw text has capitals, punctuation, numbers, and words like "the" and "is" that often don’t help prediction. Cleansing = lowercasing, removing punctuation/numbers, stripping spaces, and optionally removing stopwords and reducing words to stems/lemmas so the model sees a cleaner, more uniform representation.

NLTK Setup

We use NLTK for tokenizing, stopwords, and lemmatization. Download the required data once.

import nltk
nltk.download('punkt')       # for tokenizing
nltk.download('stopwords')  # list of common words to drop
nltk.download('wordnet')     # for lemmatization
nltk.download('averaged_perceptron_tagger')  # for POS tagging

Basic Preprocessing: Tokenize, POS, NER

Tokenize a sentence into words, tag each word with its part of speech (POS), and optionally run Named Entity Recognition (NER).

from nltk import word_tokenize, pos_tag, ne_chunk

sentence = 'Elon Musk founded SpaceX in California in 2002.'
print(sentence)

tokens = word_tokenize(sentence)
print("Tokens:", tokens)

pos = pos_tag(tokens)
print("POS tags:", pos)

# NER: find people, places, organizations
tree = ne_chunk(pos)
print("NER tree:", tree)

Data Cleansing Pipeline

Lowercase, remove non-letters (or keep only letters), strip spaces, remove stopwords, then stem or lemmatize.

import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^a-z\s]', ' ', text)   # keep only letters and spaces
    text = ' '.join(text.split())           # normalize spaces
    return text

def remove_stopwords(tokens):
    stop = set(stopwords.words('english'))
    return [t for t in tokens if t not in stop]

# Stemming (chopping endings)
stemmer = PorterStemmer()
print(stemmer.stem("running"))   # run

# Lemmatization (proper dictionary form)
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("better", pos='a'))  # good

Applications

Once text is cleansed and tokenized, you can:

These build on the NLP Basics lesson (tokenization, TF-IDF, stopwords, stem/lemma). Combine the cleansing pipeline above with TfidfVectorizer and LogisticRegression (or Naive Bayes) from the Classification module for a full text classification workflow.

🚫 Common Mistakes in NLP Cleansing

πŸ’­ Short reflection

In one sentence: why is it important to remove stopwords and normalize text before building a TF-IDF matrix for classification?

βœ… CORE (Must know)

πŸ“š NON-CORE (Good to know)