👶 ABSOLUTE BEGINNER FRIENDLY

💬 Natural Language Processing (NLP)

Teach computers to understand human language! From text messages to AI chatbots - NLP makes it possible.

Chapter 1: What is NLP?

👶 Explain Like I'm 5

NLP stands for Natural Language Processing.

It's teaching computers to understand human language - the way we talk and write!

Just like you learned to read and understand words, NLP teaches machines to do the same.

🌍 You Use NLP Every Day!

  • Siri/Alexa: Understanding your voice commands
  • Gmail: Suggesting responses, detecting spam
  • Google Translate: Converting languages
  • ChatGPT: Having conversations with AI
  • Auto-correct: Fixing your typos
  • Netflix: "Similar movies" based on descriptions

🧩 The Challenge

Computers understand numbers, not words!

Human: "I love pizza!" 🍕❤️

Computer: "01001001 00100000..." 🤖❓

NLP converts human language → Numbers the computer understands!

Chapter 2: Tokenization (Breaking Text Into Pieces)

👶 Explain Like I'm 5

Tokenization is like cutting a sentence into individual words!

Just like cutting a pizza into slices 🍕, we cut text into pieces called "tokens."

✂️ Tokenization Example

Original Sentence:
"Elon Musk founded SpaceX in California in 2002."

After Tokenization (cut into pieces):
┌──────┬──────┬─────────┬────────┬────┬────────────┬────┬──────┬───┐
│ Elon │ Musk │ founded │ SpaceX │ in │ California │ in │ 2002 │ . │
└──────┴──────┴─────────┴────────┴────┴────────────┴────┴──────┴───┘
  [0]    [1]     [2]       [3]    [4]      [5]      [6]   [7]  [8]
import nltk
from nltk import word_tokenize

# Download the tokenizer data (only needed once)
nltk.download('punkt')

# Our sentence
sentence = "Elon Musk founded SpaceX in California in 2002."

# Tokenize - break into individual words
tokens = word_tokenize(sentence)

print("Original sentence:", sentence)
print("Tokens:", tokens)
print("Number of tokens:", len(tokens))

# Output:
# Original sentence: Elon Musk founded SpaceX in California in 2002.
# Tokens: ['Elon', 'Musk', 'founded', 'SpaceX', 'in', 'California', 'in', '2002', '.']
# Number of tokens: 9

🤔 Why is Tokenization Useful?

Computers can't read sentences like humans. They need individual pieces!

After tokenization, we can:

  • Count how often each word appears
  • Find important words
  • Look up words in dictionaries
  • Convert words to numbers for machine learning

Chapter 3: POS Tagging (What Type of Word?)

👶 Explain Like I'm 5

POS stands for Part of Speech.

Remember in school when you learned about nouns, verbs, adjectives?

POS tagging is teaching the computer: "This word is a noun, this one is a verb..."

POS Tag Meaning Example
NNP Proper Noun (names) Elon, California, SpaceX
VBD Verb, Past Tense founded, walked, said
IN Preposition in, on, at, to
CD Cardinal Number 2002, five, 100
JJ Adjective big, happy, fast
RB Adverb quickly, very, well
import nltk
from nltk import word_tokenize, pos_tag

# Download necessary data
nltk.download('averaged_perceptron_tagger')

sentence = "Elon Musk founded SpaceX in California in 2002."

# Step 1: Tokenize
tokens = word_tokenize(sentence)

# Step 2: POS Tag each token
tagged = pos_tag(tokens)

print("Word → Part of Speech:")
for word, tag in tagged:
    print(f"  {word} → {tag}")

# Output:
# Word → Part of Speech:
#   Elon → NNP (Proper Noun)
#   Musk → NNP (Proper Noun)
#   founded → VBD (Verb, Past Tense)
#   SpaceX → NNP (Proper Noun)
#   in → IN (Preposition)
#   California → NNP (Proper Noun)
#   in → IN (Preposition)
#   2002 → CD (Cardinal Number)
#   . → . (Punctuation)

Chapter 4: Named Entity Recognition (Finding Important Things)

👶 Explain Like I'm 5

NER stands for Named Entity Recognition.

It's like a highlighter that finds IMPORTANT things in text:

  • People's names (PERSON)
  • Companies/Organizations (ORGANIZATION)
  • Places (GPE - Geo-Political Entity)
  • Dates (DATE)

🔍 NER in Action

Original: "Elon Musk founded SpaceX in California in 2002."

After NER:
┌──────────────────┬────────────────┐
│      Entity      │     Type       │
├──────────────────┼────────────────┤
│    Elon Musk     │   PERSON       │
│    SpaceX        │   ORGANIZATION │
│    California    │   GPE (Place)  │
│    2002          │   DATE         │
└──────────────────┴────────────────┘
import nltk
from nltk import word_tokenize, pos_tag, ne_chunk

# Download necessary data
nltk.download('maxent_ne_chunker')
nltk.download('words')

sentence = "Elon Musk founded SpaceX in California in 2002."

# Step 1: Tokenize
tokens = word_tokenize(sentence)

# Step 2: POS Tag
tagged = pos_tag(tokens)

# Step 3: Named Entity Recognition
entities = ne_chunk(tagged)

print("Named Entities Found:")
print(entities)

# Output:
# (S
#   (PERSON Elon/NNP)      ← Person's name!
#   (PERSON Musk/NNP)      ← Person's name!
#   founded/VBD
#   (ORGANIZATION SpaceX/NNP)  ← Company!
#   in/IN
#   (GPE California/NNP)   ← Place!
#   in/IN
#   2002/CD
#   ./.)

🎯 Real-World Uses of NER

  • News: Automatically categorize articles by people/companies mentioned
  • Customer Support: Extract product names from complaints
  • Finance: Find company names in news for stock analysis
  • Healthcare: Extract drug names and conditions from medical records

Chapter 5: TF-IDF (Finding Important Words)

👶 Explain Like I'm 5

TF-IDF answers: "Which words in this document are REALLY important?"

Words like "the", "is", "and" appear everywhere - they're NOT important.

Words that appear a lot in ONE document but rarely in others - THOSE are important!

What Does TF-IDF Stand For?

📊 TF-IDF = TF × IDF

TF
Term Frequency

How often does this word appear in THIS document?

TF = (times word appears) / (total words in doc)

IDF
Inverse Document Frequency

How RARE is this word across ALL documents?

IDF = log(total docs / docs containing word)

📝 Example with 3 Documents

Document Text
D1 "Data Science is fun"
D2 "Python makes Data Analysis easy"
D3 "AI and Data Science are related"

"Data" appears in ALL 3 docs → IDF is LOW (common word)

"Python" appears in only 1 doc → IDF is HIGH (unique word!)

from sklearn.feature_extraction.text import TfidfVectorizer

# Our 3 documents
docs = [
    "Data Science is fun",
    "Python makes Data Analysis easy",
    "AI and Data Science are related"
]

# Create the TF-IDF vectorizer
vectorizer = TfidfVectorizer()

# Calculate TF-IDF for each word in each document
tfidf_matrix = vectorizer.fit_transform(docs)

# See the vocabulary (unique words)
print("Words found:")
print(vectorizer.get_feature_names_out())

# Output: ['ai', 'analysis', 'and', 'are', 'data', 'easy', 'fun', 
#          'is', 'makes', 'python', 'related', 'science']

# See the TF-IDF scores
print("\nTF-IDF Matrix (rows=docs, cols=words):")
print(tfidf_matrix.toarray().round(2))

# Note: Higher numbers = more important in that document!
# "data" has low scores (common), "python" has high score (unique)

Remove Unimportant Words (Stop Words)

# Remove common words like "is", "and", "the"
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(docs)

print("Words after removing stop words:")
print(vectorizer.get_feature_names_out())

# Output: ['ai', 'analysis', 'data', 'easy', 'fun', 
#          'makes', 'python', 'related', 'science']
# Notice: 'is', 'and', 'are' are removed!

Chapter 6: Stemming & Lemmatization (Finding Word Roots)

🤔 The Problem

Consider: "playing", "plays", "played", "player"

These are all forms of the same word "play"!

But a computer sees them as 4 DIFFERENT words.

We need to reduce them to their ROOT form!

Two Ways to Find Roots

✂️ Stemming (Quick & Rough)

Just chops off the end of words!

  • playing → play
  • happily → happili ❌
  • running → run

Fast but can create non-words!

📚 Lemmatization (Proper & Smart)

Uses dictionary to find real root!

  • playing → play ✅
  • happily → happy ✅
  • better → good ✅

Slower but always real words!

from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize
import nltk

nltk.download('wordnet')

sentence = "The children are playing happily while their teacher watches them."

# Create stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Tokenize the sentence
tokens = word_tokenize(sentence)

# Apply stemming and lemmatization to each token
stems = [stemmer.stem(word) for word in tokens]
lemmas = [lemmatizer.lemmatize(word) for word in tokens]

print("Original:", tokens)
print("Stems:", stems)
print("Lemmas:", lemmas)

# Output:
# Original: ['The', 'children', 'are', 'playing', 'happily', 'while', 'their', 'teacher', 'watches', 'them', '.']
# Stems:    ['the', 'children', 'are', 'play', 'happili', 'while', 'their', 'teacher', 'watch', 'them', '.']
#           Note: 'happili' is not a real word! ❌
# Lemmas:   ['The', 'child', 'are', 'playing', 'happily', 'while', 'their', 'teacher', 'watch', 'them', '.']
#           Note: 'children' → 'child' (proper!) ✅

🚫 Common Mistakes in NLP Basics

💭 Short reflection

In one sentence: when would you choose lemmatization over stemming for a sentiment or search application?

✅ CORE (Must know)

📚 NON-CORE (Good to know)

Chapter 7: Summary - NLP Pipeline

📋 Complete NLP Preprocessing Pipeline

1
Tokenization

Break text into individual words/tokens

2
Lowercasing

"Hello" and "hello" should be the same word

3
Remove Stop Words

Remove common words like "the", "is", "and"

4
Stemming/Lemmatization

Reduce words to their root form

5
Vectorization (TF-IDF)

Convert text to numbers for ML

Concept What It Does Example
Tokenization Splits text into words "Hello world" → ["Hello", "world"]
POS Tagging Labels word types "run" → VB (verb)
NER Finds named entities "Elon" → PERSON
TF-IDF Measures word importance Unique words get high scores
Stemming Chops word endings "running" → "run"
Lemmatization Finds proper root "better" → "good"

🎉 Congratulations!

You now understand the basics of NLP!

These techniques power chatbots, search engines, and AI assistants!