Natural Language Processing (NLP) | Fakhruddin Khambaty's Learning Hub

Chapter 1: What is NLP?

👶 Explain Like I'm 5

NLP stands for Natural Language Processing.

It's teaching computers to understand human language - the way we talk and write!

Just like you learned to read and understand words, NLP teaches machines to do the same.

🌍 You Use NLP Every Day!

Siri/Alexa: Understanding your voice commands
Gmail: Suggesting responses, detecting spam
Google Translate: Converting languages
ChatGPT: Having conversations with AI
Auto-correct: Fixing your typos
Netflix: "Similar movies" based on descriptions

🧩 The Challenge

Computers understand numbers, not words!

Human: "I love pizza!" 🍕❤️

Computer: "01001001 00100000..." 🤖❓

NLP converts human language → Numbers the computer understands!

Chapter 2: Tokenization (Breaking Text Into Pieces)

👶 Explain Like I'm 5

Tokenization is like cutting a sentence into individual words!

Just like cutting a pizza into slices 🍕, we cut text into pieces called "tokens."

✂️ Tokenization Example

Original Sentence:
"Elon Musk founded SpaceX in California in 2002."

After Tokenization (cut into pieces):
┌──────┬──────┬─────────┬────────┬────┬────────────┬────┬──────┬───┐
│ Elon │ Musk │ founded │ SpaceX │ in │ California │ in │ 2002 │ . │
└──────┴──────┴─────────┴────────┴────┴────────────┴────┴──────┴───┘
  [0]    [1]     [2]       [3]    [4]      [5]      [6]   [7]  [8]

import nltk
from nltk import word_tokenize

# Download the tokenizer data (only needed once)
nltk.download('punkt')

# Our sentence
sentence = "Elon Musk founded SpaceX in California in 2002."

# Tokenize - break into individual words
tokens = word_tokenize(sentence)

print("Original sentence:", sentence)
print("Tokens:", tokens)
print("Number of tokens:", len(tokens))

# Output:
# Original sentence: Elon Musk founded SpaceX in California in 2002.
# Tokens: ['Elon', 'Musk', 'founded', 'SpaceX', 'in', 'California', 'in', '2002', '.']
# Number of tokens: 9

🤔 Why is Tokenization Useful?

Computers can't read sentences like humans. They need individual pieces!

After tokenization, we can:

Count how often each word appears
Find important words
Look up words in dictionaries
Convert words to numbers for machine learning

Chapter 3: POS Tagging (What Type of Word?)

👶 Explain Like I'm 5

POS stands for Part of Speech.

Remember in school when you learned about nouns, verbs, adjectives?

POS tagging is teaching the computer: "This word is a noun, this one is a verb..."

POS Tag	Meaning	Example
NNP	Proper Noun (names)	Elon, California, SpaceX
VBD	Verb, Past Tense	founded, walked, said
IN	Preposition	in, on, at, to
CD	Cardinal Number	2002, five, 100
JJ	Adjective	big, happy, fast
RB	Adverb	quickly, very, well

import nltk
from nltk import word_tokenize, pos_tag

# Download necessary data
nltk.download('averaged_perceptron_tagger')

sentence = "Elon Musk founded SpaceX in California in 2002."

# Step 1: Tokenize
tokens = word_tokenize(sentence)

# Step 2: POS Tag each token
tagged = pos_tag(tokens)

print("Word → Part of Speech:")
for word, tag in tagged:
    print(f"  {word} → {tag}")

# Output:
# Word → Part of Speech:
#   Elon → NNP (Proper Noun)
#   Musk → NNP (Proper Noun)
#   founded → VBD (Verb, Past Tense)
#   SpaceX → NNP (Proper Noun)
#   in → IN (Preposition)
#   California → NNP (Proper Noun)
#   in → IN (Preposition)
#   2002 → CD (Cardinal Number)
#   . → . (Punctuation)

Chapter 4: Named Entity Recognition (Finding Important Things)

👶 Explain Like I'm 5

NER stands for Named Entity Recognition.

It's like a highlighter that finds IMPORTANT things in text:

People's names (PERSON)
Companies/Organizations (ORGANIZATION)
Places (GPE - Geo-Political Entity)
Dates (DATE)

🔍 NER in Action

Original: "Elon Musk founded SpaceX in California in 2002."

After NER:
┌──────────────────┬────────────────┐
│      Entity      │     Type       │
├──────────────────┼────────────────┤
│    Elon Musk     │   PERSON       │
│    SpaceX        │   ORGANIZATION │
│    California    │   GPE (Place)  │
│    2002          │   DATE         │
└──────────────────┴────────────────┘

import nltk
from nltk import word_tokenize, pos_tag, ne_chunk

# Download necessary data
nltk.download('maxent_ne_chunker')
nltk.download('words')

sentence = "Elon Musk founded SpaceX in California in 2002."

# Step 1: Tokenize
tokens = word_tokenize(sentence)

# Step 2: POS Tag
tagged = pos_tag(tokens)

# Step 3: Named Entity Recognition
entities = ne_chunk(tagged)

print("Named Entities Found:")
print(entities)

# Output:
# (S
#   (PERSON Elon/NNP)      ← Person's name!
#   (PERSON Musk/NNP)      ← Person's name!
#   founded/VBD
#   (ORGANIZATION SpaceX/NNP)  ← Company!
#   in/IN
#   (GPE California/NNP)   ← Place!
#   in/IN
#   2002/CD
#   ./.)

🎯 Real-World Uses of NER

News: Automatically categorize articles by people/companies mentioned
Customer Support: Extract product names from complaints
Finance: Find company names in news for stock analysis
Healthcare: Extract drug names and conditions from medical records

Chapter 5: TF-IDF (Finding Important Words)

👶 Explain Like I'm 5

TF-IDF answers: "Which words in this document are REALLY important?"

Words like "the", "is", "and" appear everywhere - they're NOT important.

Words that appear a lot in ONE document but rarely in others - THOSE are important!

What Does TF-IDF Stand For?

📊 TF-IDF = TF × IDF

Term Frequency

How often does this word appear in THIS document?

TF = (times word appears) / (total words in doc)

IDF

Inverse Document Frequency

How RARE is this word across ALL documents?

IDF = log(total docs / docs containing word)

📝 Example with 3 Documents

Document	Text
D1	"Data Science is fun"
D2	"Python makes Data Analysis easy"
D3	"AI and Data Science are related"

"Data" appears in ALL 3 docs → IDF is LOW (common word)

"Python" appears in only 1 doc → IDF is HIGH (unique word!)

from sklearn.feature_extraction.text import TfidfVectorizer

# Our 3 documents
docs = [
    "Data Science is fun",
    "Python makes Data Analysis easy",
    "AI and Data Science are related"
]

# Create the TF-IDF vectorizer
vectorizer = TfidfVectorizer()

# Calculate TF-IDF for each word in each document
tfidf_matrix = vectorizer.fit_transform(docs)

# See the vocabulary (unique words)
print("Words found:")
print(vectorizer.get_feature_names_out())

# Output: ['ai', 'analysis', 'and', 'are', 'data', 'easy', 'fun', 
#          'is', 'makes', 'python', 'related', 'science']

# See the TF-IDF scores
print("\nTF-IDF Matrix (rows=docs, cols=words):")
print(tfidf_matrix.toarray().round(2))

# Note: Higher numbers = more important in that document!
# "data" has low scores (common), "python" has high score (unique)

Remove Unimportant Words (Stop Words)

# Remove common words like "is", "and", "the"
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(docs)

print("Words after removing stop words:")
print(vectorizer.get_feature_names_out())

# Output: ['ai', 'analysis', 'data', 'easy', 'fun', 
#          'makes', 'python', 'related', 'science']
# Notice: 'is', 'and', 'are' are removed!

Chapter 6: Stemming & Lemmatization (Finding Word Roots)

🤔 The Problem

Consider: "playing", "plays", "played", "player"

These are all forms of the same word "play"!

But a computer sees them as 4 DIFFERENT words.

We need to reduce them to their ROOT form!

Two Ways to Find Roots

✂️ Stemming (Quick & Rough)

Just chops off the end of words!

playing → play
happily → happili ❌
running → run

Fast but can create non-words!

📚 Lemmatization (Proper & Smart)

Uses dictionary to find real root!

playing → play ✅
happily → happy ✅
better → good ✅

Slower but always real words!

from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize
import nltk

nltk.download('wordnet')

sentence = "The children are playing happily while their teacher watches them."

# Create stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Tokenize the sentence
tokens = word_tokenize(sentence)

# Apply stemming and lemmatization to each token
stems = [stemmer.stem(word) for word in tokens]
lemmas = [lemmatizer.lemmatize(word) for word in tokens]

print("Original:", tokens)
print("Stems:", stems)
print("Lemmas:", lemmas)

# Output:
# Original: ['The', 'children', 'are', 'playing', 'happily', 'while', 'their', 'teacher', 'watches', 'them', '.']
# Stems:    ['the', 'children', 'are', 'play', 'happili', 'while', 'their', 'teacher', 'watch', 'them', '.']
#           Note: 'happili' is not a real word! ❌
# Lemmas:   ['The', 'child', 'are', 'playing', 'happily', 'while', 'their', 'teacher', 'watch', 'them', '.']
#           Note: 'children' → 'child' (proper!) ✅

🚫 Common Mistakes in NLP Basics

Stemming when meaning matters — Stemming can produce non-words (e.g. "happili"); use lemmatization when you need real words for sentiment or search.
Removing stopwords blindly — In "not good" the "not" is a stopword but changes meaning; for sentiment, consider keeping negations or using n-grams.
Fitting TfidfVectorizer on test data — Fit only on train, then transform train and test; otherwise you leak test vocabulary into the model.

💭 Short reflection

In one sentence: when would you choose lemmatization over stemming for a sentiment or search application?

✅ CORE (Must know)

Tokenization: split text into words/tokens.
Lowercasing, stopwords: normalize and remove noise.
Stemming vs lemmatization: stem chops endings; lemma uses dictionary (better for meaning).
TF-IDF: weight terms by importance; pipeline: tokenize → clean → vectorize.

📚 NON-CORE (Good to know)

POS tagging, NER; n-grams; word embeddings.

Chapter 7: Summary - NLP Pipeline

📋 Complete NLP Preprocessing Pipeline

Tokenization

Break text into individual words/tokens

Lowercasing

"Hello" and "hello" should be the same word

Remove Stop Words

Remove common words like "the", "is", "and"

Stemming/Lemmatization

Reduce words to their root form

Vectorization (TF-IDF)

Convert text to numbers for ML

Concept	What It Does	Example
Tokenization	Splits text into words	"Hello world" → ["Hello", "world"]
POS Tagging	Labels word types	"run" → VB (verb)
NER	Finds named entities	"Elon" → PERSON
TF-IDF	Measures word importance	Unique words get high scores
Stemming	Chops word endings	"running" → "run"
Lemmatization	Finds proper root	"better" → "good"

🎉 Congratulations!

You now understand the basics of NLP!

These techniques power chatbots, search engines, and AI assistants!

Back to Course Hub Next: Data Cleansing & Applications

💬 Natural Language Processing (NLP)

Chapter 1: What is NLP?

👶 Explain Like I'm 5

🌍 You Use NLP Every Day!

🧩 The Challenge

Chapter 2: Tokenization (Breaking Text Into Pieces)

👶 Explain Like I'm 5

✂️ Tokenization Example

🤔 Why is Tokenization Useful?

Chapter 3: POS Tagging (What Type of Word?)

👶 Explain Like I'm 5

Chapter 4: Named Entity Recognition (Finding Important Things)

👶 Explain Like I'm 5

🔍 NER in Action

🎯 Real-World Uses of NER

Chapter 5: TF-IDF (Finding Important Words)

👶 Explain Like I'm 5

What Does TF-IDF Stand For?

📊 TF-IDF = TF × IDF

Term Frequency

Inverse Document Frequency

📝 Example with 3 Documents

Remove Unimportant Words (Stop Words)

Chapter 6: Stemming & Lemmatization (Finding Word Roots)

🤔 The Problem

Two Ways to Find Roots

✂️ Stemming (Quick & Rough)

📚 Lemmatization (Proper & Smart)

🚫 Common Mistakes in NLP Basics

💭 Short reflection

✅ CORE (Must know)

📚 NON-CORE (Good to know)

Chapter 7: Summary - NLP Pipeline

📋 Complete NLP Preprocessing Pipeline

Tokenization

Lowercasing

Remove Stop Words

Stemming/Lemmatization

Vectorization (TF-IDF)

🎉 Congratulations!