Clean raw text and use it for tokenization, POS tagging, NER, and simple applications like sentiment or classification.
Raw text has capitals, punctuation, numbers, and words like "the" and "is" that often donβt help prediction. Cleansing = lowercasing, removing punctuation/numbers, stripping spaces, and optionally removing stopwords and reducing words to stems/lemmas so the model sees a cleaner, more uniform representation.
We use NLTK for tokenizing, stopwords, and lemmatization. Download the required data once.
import nltk nltk.download('punkt') # for tokenizing nltk.download('stopwords') # list of common words to drop nltk.download('wordnet') # for lemmatization nltk.download('averaged_perceptron_tagger') # for POS tagging
Tokenize a sentence into words, tag each word with its part of speech (POS), and optionally run Named Entity Recognition (NER).
from nltk import word_tokenize, pos_tag, ne_chunk sentence = 'Elon Musk founded SpaceX in California in 2002.' print(sentence) tokens = word_tokenize(sentence) print("Tokens:", tokens) pos = pos_tag(tokens) print("POS tags:", pos) # NER: find people, places, organizations tree = ne_chunk(pos) print("NER tree:", tree)
Lowercase, remove non-letters (or keep only letters), strip spaces, remove stopwords, then stem or lemmatize.
import re from nltk.corpus import stopwords from nltk.stem import PorterStemmer from nltk.stem.wordnet import WordNetLemmatizer def clean_text(text): text = text.lower() text = re.sub(r'[^a-z\s]', ' ', text) # keep only letters and spaces text = ' '.join(text.split()) # normalize spaces return text def remove_stopwords(tokens): stop = set(stopwords.words('english')) return [t for t in tokens if t not in stop] # Stemming (chopping endings) stemmer = PorterStemmer() print(stemmer.stem("running")) # run # Lemmatization (proper dictionary form) lemmatizer = WordNetLemmatizer() print(lemmatizer.lemmatize("better", pos='a')) # good
Once text is cleansed and tokenized, you can:
TfidfVectorizer on cleaned text, then train a model (e.g. Logistic Regression) for sentiment or topic classification.These build on the NLP Basics lesson (tokenization, TF-IDF, stopwords, stem/lemma). Combine the cleansing pipeline above with TfidfVectorizer and LogisticRegression (or Naive Bayes) from the Classification module for a full text classification workflow.
In one sentence: why is it important to remove stopwords and normalize text before building a TF-IDF matrix for classification?