Teach computers to understand human language! From text messages to AI chatbots - NLP makes it possible.
NLP stands for Natural Language Processing.
It's teaching computers to understand human language - the way we talk and write!
Just like you learned to read and understand words, NLP teaches machines to do the same.
Computers understand numbers, not words!
Human: "I love pizza!" 🍕❤️
Computer: "01001001 00100000..." 🤖❓
NLP converts human language → Numbers the computer understands!
Tokenization is like cutting a sentence into individual words!
Just like cutting a pizza into slices 🍕, we cut text into pieces called "tokens."
Original Sentence: "Elon Musk founded SpaceX in California in 2002." After Tokenization (cut into pieces): ┌──────┬──────┬─────────┬────────┬────┬────────────┬────┬──────┬───┐ │ Elon │ Musk │ founded │ SpaceX │ in │ California │ in │ 2002 │ . │ └──────┴──────┴─────────┴────────┴────┴────────────┴────┴──────┴───┘ [0] [1] [2] [3] [4] [5] [6] [7] [8]
import nltk from nltk import word_tokenize # Download the tokenizer data (only needed once) nltk.download('punkt') # Our sentence sentence = "Elon Musk founded SpaceX in California in 2002." # Tokenize - break into individual words tokens = word_tokenize(sentence) print("Original sentence:", sentence) print("Tokens:", tokens) print("Number of tokens:", len(tokens)) # Output: # Original sentence: Elon Musk founded SpaceX in California in 2002. # Tokens: ['Elon', 'Musk', 'founded', 'SpaceX', 'in', 'California', 'in', '2002', '.'] # Number of tokens: 9
Computers can't read sentences like humans. They need individual pieces!
After tokenization, we can:
POS stands for Part of Speech.
Remember in school when you learned about nouns, verbs, adjectives?
POS tagging is teaching the computer: "This word is a noun, this one is a verb..."
| POS Tag | Meaning | Example |
|---|---|---|
| NNP | Proper Noun (names) | Elon, California, SpaceX |
| VBD | Verb, Past Tense | founded, walked, said |
| IN | Preposition | in, on, at, to |
| CD | Cardinal Number | 2002, five, 100 |
| JJ | Adjective | big, happy, fast |
| RB | Adverb | quickly, very, well |
import nltk from nltk import word_tokenize, pos_tag # Download necessary data nltk.download('averaged_perceptron_tagger') sentence = "Elon Musk founded SpaceX in California in 2002." # Step 1: Tokenize tokens = word_tokenize(sentence) # Step 2: POS Tag each token tagged = pos_tag(tokens) print("Word → Part of Speech:") for word, tag in tagged: print(f" {word} → {tag}") # Output: # Word → Part of Speech: # Elon → NNP (Proper Noun) # Musk → NNP (Proper Noun) # founded → VBD (Verb, Past Tense) # SpaceX → NNP (Proper Noun) # in → IN (Preposition) # California → NNP (Proper Noun) # in → IN (Preposition) # 2002 → CD (Cardinal Number) # . → . (Punctuation)
NER stands for Named Entity Recognition.
It's like a highlighter that finds IMPORTANT things in text:
Original: "Elon Musk founded SpaceX in California in 2002." After NER: ┌──────────────────┬────────────────┐ │ Entity │ Type │ ├──────────────────┼────────────────┤ │ Elon Musk │ PERSON │ │ SpaceX │ ORGANIZATION │ │ California │ GPE (Place) │ │ 2002 │ DATE │ └──────────────────┴────────────────┘
import nltk from nltk import word_tokenize, pos_tag, ne_chunk # Download necessary data nltk.download('maxent_ne_chunker') nltk.download('words') sentence = "Elon Musk founded SpaceX in California in 2002." # Step 1: Tokenize tokens = word_tokenize(sentence) # Step 2: POS Tag tagged = pos_tag(tokens) # Step 3: Named Entity Recognition entities = ne_chunk(tagged) print("Named Entities Found:") print(entities) # Output: # (S # (PERSON Elon/NNP) ← Person's name! # (PERSON Musk/NNP) ← Person's name! # founded/VBD # (ORGANIZATION SpaceX/NNP) ← Company! # in/IN # (GPE California/NNP) ← Place! # in/IN # 2002/CD # ./.)
TF-IDF answers: "Which words in this document are REALLY important?"
Words like "the", "is", "and" appear everywhere - they're NOT important.
Words that appear a lot in ONE document but rarely in others - THOSE are important!
How often does this word appear in THIS document?
TF = (times word appears) / (total words in doc)
How RARE is this word across ALL documents?
IDF = log(total docs / docs containing word)
| Document | Text |
|---|---|
| D1 | "Data Science is fun" |
| D2 | "Python makes Data Analysis easy" |
| D3 | "AI and Data Science are related" |
"Data" appears in ALL 3 docs → IDF is LOW (common word)
"Python" appears in only 1 doc → IDF is HIGH (unique word!)
from sklearn.feature_extraction.text import TfidfVectorizer # Our 3 documents docs = [ "Data Science is fun", "Python makes Data Analysis easy", "AI and Data Science are related" ] # Create the TF-IDF vectorizer vectorizer = TfidfVectorizer() # Calculate TF-IDF for each word in each document tfidf_matrix = vectorizer.fit_transform(docs) # See the vocabulary (unique words) print("Words found:") print(vectorizer.get_feature_names_out()) # Output: ['ai', 'analysis', 'and', 'are', 'data', 'easy', 'fun', # 'is', 'makes', 'python', 'related', 'science'] # See the TF-IDF scores print("\nTF-IDF Matrix (rows=docs, cols=words):") print(tfidf_matrix.toarray().round(2)) # Note: Higher numbers = more important in that document! # "data" has low scores (common), "python" has high score (unique)
# Remove common words like "is", "and", "the" vectorizer = TfidfVectorizer(stop_words='english') tfidf_matrix = vectorizer.fit_transform(docs) print("Words after removing stop words:") print(vectorizer.get_feature_names_out()) # Output: ['ai', 'analysis', 'data', 'easy', 'fun', # 'makes', 'python', 'related', 'science'] # Notice: 'is', 'and', 'are' are removed!
Consider: "playing", "plays", "played", "player"
These are all forms of the same word "play"!
But a computer sees them as 4 DIFFERENT words.
We need to reduce them to their ROOT form!
Just chops off the end of words!
Fast but can create non-words!
Uses dictionary to find real root!
Slower but always real words!
from nltk.stem import PorterStemmer, WordNetLemmatizer from nltk.tokenize import word_tokenize import nltk nltk.download('wordnet') sentence = "The children are playing happily while their teacher watches them." # Create stemmer and lemmatizer stemmer = PorterStemmer() lemmatizer = WordNetLemmatizer() # Tokenize the sentence tokens = word_tokenize(sentence) # Apply stemming and lemmatization to each token stems = [stemmer.stem(word) for word in tokens] lemmas = [lemmatizer.lemmatize(word) for word in tokens] print("Original:", tokens) print("Stems:", stems) print("Lemmas:", lemmas) # Output: # Original: ['The', 'children', 'are', 'playing', 'happily', 'while', 'their', 'teacher', 'watches', 'them', '.'] # Stems: ['the', 'children', 'are', 'play', 'happili', 'while', 'their', 'teacher', 'watch', 'them', '.'] # Note: 'happili' is not a real word! ❌ # Lemmas: ['The', 'child', 'are', 'playing', 'happily', 'while', 'their', 'teacher', 'watch', 'them', '.'] # Note: 'children' → 'child' (proper!) ✅
In one sentence: when would you choose lemmatization over stemming for a sentiment or search application?
Break text into individual words/tokens
"Hello" and "hello" should be the same word
Remove common words like "the", "is", "and"
Reduce words to their root form
Convert text to numbers for ML
| Concept | What It Does | Example |
|---|---|---|
| Tokenization | Splits text into words | "Hello world" → ["Hello", "world"] |
| POS Tagging | Labels word types | "run" → VB (verb) |
| NER | Finds named entities | "Elon" → PERSON |
| TF-IDF | Measures word importance | Unique words get high scores |
| Stemming | Chops word endings | "running" → "run" |
| Lemmatization | Finds proper root | "better" → "good" |
You now understand the basics of NLP!
These techniques power chatbots, search engines, and AI assistants!