Learn how machines make decisions like humans do - by asking yes/no questions!
Remember playing "20 Questions"? ๐ฎ
"Is it an animal?" โ YES
"Does it have 4 legs?" โ YES
"Does it bark?" โ YES
"It's a DOG!" ๐
A Decision Tree works exactly like this! It asks simple questions to make predictions.
A decision tree is a flowchart of yes/no questions: you start at the top, answer each question, follow the branch, and when you reach a leaf you get the prediction (e.g. "play tennis" or "don't play"). The algorithm learns which questions to ask and in what order from the training data.
A single tree can overfit: it memorizes the training data and gets confused on new data. A Random Forest builds many trees (each on a random subset of data and features) and combines their votes. That usually gives a more stable and accurate modelโlike asking many people instead of one.
๐ค๏ธ What's the weather?
/ | \
Sunny Overcast Rainy
/ | \
๐จ Is it windy? โ
YES, PLAY! ๐จ Is it windy?
/ \ / \
Yes No Yes No
| | | |
โ NO PLAY โ
YES PLAY โ NO PLAY โ
YES PLAY
Each question splits the data into smaller groups.
The goal? Make each group as "pure" as possible (all Yes or all No).
| Part | What It Is | Example |
|---|---|---|
| ๐ Root Node | The first question (top of tree) | "What's the weather?" |
| ๐ Internal Node | Questions in the middle | "Is it windy?" |
| ๐น Branch | The answer paths | "Yes" or "No" |
| ๐ Leaf Node | Final prediction (bottom) | "Play Tennis" or "Don't Play" |
"Which question separates my data best?"
Imagine a box of 50 red balls and 50 blue balls:
The algorithm uses math (Gini Impurity or Entropy) to measure "purity."
Gini = 0 โ Perfectly pure (all same class) โ
Gini = 0.5 โ Completely mixed (50-50 split) โ
The algorithm picks the question that reduces Gini the most!
from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import train_test_split from sklearn.datasets import load_iris import pandas as pd # Load the famous Iris dataset (flowers!) iris = load_iris() X = iris.data # Features (petal length, width, etc.) y = iris.target # Labels (Setosa, Versicolor, Virginica) # Split data: 80% train, 20% test X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) # Create and train the Decision Tree tree = DecisionTreeClassifier(max_depth=3, random_state=42) tree.fit(X_train, y_train) # Check accuracy accuracy = tree.score(X_test, y_test) print(f"Accuracy: {accuracy:.2%}") # Output: Accuracy: 100.00%
from sklearn.tree import plot_tree import matplotlib.pyplot as plt # Create a beautiful tree visualization plt.figure(figsize=(20, 10)) plot_tree(tree, feature_names=iris.feature_names, class_names=iris.target_names, filled=True, rounded=True, fontsize=10) plt.title("Decision Tree for Iris Classification") plt.show()
๐บ Is petal_length โค 2.45?
/ \
YES NO
| |
๐ก SETOSA Is petal_width โค 1.75?
(50/0/0) / \
YES NO
| |
๐ต VERSICOLOR ๐ฃ VIRGINICA
(0/49/1) (0/1/45)
Imagine asking one person for restaurant advice.
They might be biased! Maybe they hate spicy food, or only know cheap places.
What if you asked 100 people and went with the majority vote? ๐ณ๏ธ
That's the idea behind Random Forests!
| Problem | What Happens |
|---|---|
| Overfitting | Tree memorizes training data, fails on new data |
| High Variance | Small change in data = completely different tree |
| Instability | Remove one data point, entire tree changes |
A shallow tree (depth 1) underfits. A deep tree (depth 10) overfits. Drag the slider to see!
Watch: training error always drops with depth, but test error rises after the sweet spot. That's overfitting!
It's exactly what it sounds like - a forest of decision trees!
Instead of 1 tree, we build 100+ trees. Each tree votes, majority wins! ๐ณ๏ธ
One tree, one decision
High risk of overfitting
Can be unstable
100+ trees voting together
Much more robust
Wisdom of the crowd!
Each tree is built differently:
This creates diverse trees that make different mistakes!
When they vote together, mistakes cancel out! ๐ฏ
from sklearn.ensemble import RandomForestClassifier # Create a Random Forest with 100 trees forest = RandomForestClassifier( n_estimators=100, # 100 trees in the forest max_depth=5, # Max depth of each tree random_state=42 ) # Train the forest forest.fit(X_train, y_train) # Check accuracy accuracy = forest.score(X_test, y_test) print(f"Random Forest Accuracy: {accuracy:.2%}") # Output: Random Forest Accuracy: 100.00%
# See which features matter most! import pandas as pd importance = pd.DataFrame({ 'Feature': iris.feature_names, 'Importance': forest.feature_importances_ }).sort_values('Importance', ascending=False) print(importance) # Feature Importance # 2 petal length 0.44 โ Most important! # 3 petal width 0.42 # 0 sepal length 0.10 # 1 sepal width 0.04 โ Least important
Petal length and petal width are the most useful features for classifying iris flowers!
This is a FREE bonus from Random Forests - you learn which features matter most! ๐
| Situation | Use This | Why? |
|---|---|---|
| Need to explain decisions | ๐ณ Decision Tree | Easy to visualize and explain to stakeholders |
| Need high accuracy | ๐ฒ๐ฒ๐ฒ Random Forest | More robust, less overfitting |
| Want to know important features | ๐ฒ๐ฒ๐ฒ Random Forest | Provides feature importance scores |
| Small dataset | ๐ณ Decision Tree | Simpler, less likely to overfit |
| Large dataset | ๐ฒ๐ฒ๐ฒ Random Forest | Can capture complex patterns |
| Concept | Simple Explanation |
|---|---|
| Decision Tree | Asks yes/no questions to make predictions (like 20 Questions) |
| Root Node | First question at the top |
| Leaf Node | Final prediction at the bottom |
| Gini Impurity | Measures how "mixed" a group is (0 = pure, 0.5 = mixed) |
| Random Forest | Many trees voting together (wisdom of the crowd) |
| Feature Importance | Which features matter most for predictions |
Decision Trees and Random Forests are among the most powerful and widely-used algorithms in data science!
In one sentence: why does a Random Forest usually generalize better than a single deep decision tree on the same data?