NLP: Data Cleansing & Applications | Fakhruddin Khambaty's Learning Hub

Why Data Cleansing?

👶 In Simple Terms

Raw text has capitals, punctuation, numbers, and words like "the" and "is" that often don’t help prediction. Cleansing = lowercasing, removing punctuation/numbers, stripping spaces, and optionally removing stopwords and reducing words to stems/lemmas so the model sees a cleaner, more uniform representation.

NLTK Setup

We use NLTK for tokenizing, stopwords, and lemmatization. Download the required data once.

import nltk
nltk.download('punkt')       # for tokenizing
nltk.download('stopwords')  # list of common words to drop
nltk.download('wordnet')     # for lemmatization
nltk.download('averaged_perceptron_tagger')  # for POS tagging

Basic Preprocessing: Tokenize, POS, NER

Tokenize a sentence into words, tag each word with its part of speech (POS), and optionally run Named Entity Recognition (NER).

from nltk import word_tokenize, pos_tag, ne_chunk

sentence = 'Elon Musk founded SpaceX in California in 2002.'
print(sentence)

tokens = word_tokenize(sentence)
print("Tokens:", tokens)

pos = pos_tag(tokens)
print("POS tags:", pos)

# NER: find people, places, organizations
tree = ne_chunk(pos)
print("NER tree:", tree)

Data Cleansing Pipeline

Lowercase, remove non-letters (or keep only letters), strip spaces, remove stopwords, then stem or lemmatize.

import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^a-z\s]', ' ', text)   # keep only letters and spaces
    text = ' '.join(text.split())           # normalize spaces
    return text

def remove_stopwords(tokens):
    stop = set(stopwords.words('english'))
    return [t for t in tokens if t not in stop]

# Stemming (chopping endings)
stemmer = PorterStemmer()
print(stemmer.stem("running"))   # run

# Lemmatization (proper dictionary form)
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("better", pos='a'))  # good

Applications

Once text is cleansed and tokenized, you can:

TF-IDF + classifier: Use TfidfVectorizer on cleaned text, then train a model (e.g. Logistic Regression) for sentiment or topic classification.
Word embeddings: Use pre-trained embeddings (e.g. Word2Vec, or transformer models) for richer representations.
Sentiment analysis: Label sentences as positive/negative and train on TF-IDF or embeddings.

These build on the NLP Basics lesson (tokenization, TF-IDF, stopwords, stem/lemma). Combine the cleansing pipeline above with TfidfVectorizer and LogisticRegression (or Naive Bayes) from the Classification module for a full text classification workflow.

🚫 Common Mistakes in NLP Cleansing

Different pipeline at inference — Apply the same lowercase, regex, stopwords, and stem/lemma in the same order when predicting on new text.
Over-stripping — Removing all non-letters can break numbers and important symbols; define what to keep (e.g. keep digits for "Python3").
Lemmatizing without POS — WordNetLemmatizer often needs the part of speech (e.g. "running" → verb "run"); otherwise you may get wrong lemmas.

💭 Short reflection

In one sentence: why is it important to remove stopwords and normalize text before building a TF-IDF matrix for classification?

✅ CORE (Must know)

Cleansing: tokenize, lowercase, remove stopwords, stem or lemmatize.
Pipeline: clean text → TfidfVectorizer → train classifier (e.g. Logistic Regression, Naive Bayes).
Apply same preprocessing at inference as in training.

📚 NON-CORE (Good to know)

Word embeddings, sentiment models, custom tokenizers.

Previous: NLP Basics Course Hub