Random Forest – Ultra-Detailed Guide & Car Evaluation

What are we building?

👶 In plain English (assume you know nothing)

Imagine you run a used-car website. For every car you have a form: buying price (very high / high / medium / low), maintenance cost, number of doors, how many people it fits, luggage size, and safety rating. You want the computer to predict the class: is this car unacc (unacceptable), acc (acceptable), good, or vgood (very good)?

Random Forest = many decision trees voting together. Each tree is like a small quiz: “Is safety high?” → Yes. “Does it fit 4+ people?” → Yes. … At the end each tree says one class. We count votes: if 70 trees say “acc” and 30 say “good”, we pick “acc”. That’s it!

We’ll also see which columns matter most (e.g. safety, number of persons) and how to measure how good the model is (confusion matrix, accuracy, precision, recall).

Part 1: Random Forest theory (ultra detail)

1.1 What is a decision tree (quick recap)

A decision tree is a flowchart of yes/no questions. You start at the top (root), answer each question, follow the branch, and when you reach a leaf you get a prediction (e.g. “unacc” or “acc”). The algorithm learns which questions to ask and in what order from the data by choosing splits that make each group as “pure” as possible (e.g. measured by Gini impurity or entropy).

👶 Layman example

“Is safety = high?” → Yes. “persons ≥ 4?” → Yes. “buying = low?” → No. → Predict acc. So the tree is just a bunch of rules you can follow by hand.

1.2 Why not use just one tree?

One deep tree often overfits: it memorizes the training data (including noise) and does worse on new data. Small changes in the data can change the tree a lot (high variance). So we want many trees that each see slightly different data and different features; when they vote, their individual mistakes tend to cancel out.

👶 Layman example

Asking one strict friend “should I buy this car?” might be biased. Asking 100 friends and taking the majority vote is usually more reliable. Random Forest is that: many “tree friends” voting.

1.3 Bagging (Bootstrap Aggregating)

Bagging = build many models each on a random sample of the training data, then combine their predictions (e.g. by voting). Random Forest is bagging applied to decision trees, with one extra twist: at each split inside a tree we only consider a random subset of features.

Bootstrap sample: For each tree we draw a random sample with replacement from the training set (same size as the training set). “With replacement” means the same row can be picked more than once—so each tree sees a slightly different set of rows.
Aggregating: For classification we take the majority vote over the trees; for regression we take the average of their predictions.

👶 What “with replacement” means

Imagine 5 rows: A, B, C, D, E. We pick 5 at random, but after each pick we put the row back. So we might get A, A, C, D, E—row A twice, B never. That’s one bootstrap sample. Another tree might get B, B, C, D, D. So each tree trains on a different “view” of the data.

Animation: Bootstrap sampling (dots = rows; they light up when “picked”)

123456

Each tree gets a random subset of rows (with replacement). The animation suggests different rows being picked for different trees.

1.4 The “random” in Random Forest

Two sources of randomness:

Random data (bootstrap): Each tree is trained on a bootstrap sample of the training set.
Random features: At every split, the algorithm only considers a random subset of the features (e.g. 80% or √p). So Tree 1 might split first on “safety”, Tree 2 on “persons”, etc. That makes trees more diverse and reduces overfitting.

1.5 How voting works

For a new car, each tree in the forest outputs one class (e.g. unacc, acc, good, vgood). We count the votes; the class with the most votes wins. (For regression we average the numeric predictions.)

Animation: Five trees voting (e.g. acc, acc, good, acc, acc → majority = acc)

accaccgoodaccacc → acc (majority)

1.6 High-level flow (animated)

📊 Data → 🌲 Tree 1 🌲 Tree 2 🌲 … → 🗳️ Vote → ✅ Prediction

1.7 Glossary (every term in one place)

📖 Terms you must know

Root node: The top of the tree; the first question.
Split: Using one feature (e.g. safety = high?) to divide data into two groups.
Leaf: End of the tree; no more splits; the prediction for that path.
Bootstrap: Drawing a random sample with replacement from the training set.
Bagging: Training many models on bootstrap samples and combining their predictions (e.g. voting).
max_depth: Maximum number of levels in each tree. Deeper = more complex, risk of overfitting.
max_features: How many (or what fraction of) features to consider at each split. Smaller = more randomness, more diversity.
max_samples: Fraction (or count) of rows each tree uses. Less than 1.0 = each tree sees a subset (bootstrap).
n_estimators: Number of trees in the forest. More = often better accuracy but slower.
Confusion matrix: Table: rows = true class, columns = predicted. Diagonal = correct; off-diagonal = errors.
Precision: Of all we predicted as class X, how many were really X? (Correct X / Predicted X)
Recall: Of all real X, how many did we predict as X? (Correct X / Actual X)
F1 score: Harmonic mean of precision and recall; one number for “how good” for that class.
Feature importance: How much each feature helped when splitting (e.g. total Gini decrease); higher = more important.

Part 2: The car evaluation dataset – every column explained

The dataset has 1728 rows (cars) and 7 columns. Every column is categorical (text). There are no missing values. Here is what each column means and what values it can take.

Column	Meaning	Possible values
buying	Buying price (initial cost)	vhigh, high, med, low
maint	Maintenance cost (yearly)	vhigh, high, med, low
doors	Number of doors	2, 3, 4, 5more
persons	Capacity (how many people)	2, 4, more
lug_boot	Luggage boot size	small, med, big
safety	Estimated safety level	low, med, high
class	Target: acceptability	unacc, acc, good, vgood

👶 Why we need to encode (turn text into numbers)

The model can’t do math on words like “vhigh” or “high”. So we turn each category into dummy columns: one column per value, with 0 or 1. For example, buying becomes buying_vhigh, buying_high, buying_med, buying_low. For a row with buying = high we put 0, 1, 0, 0. That’s what get_dummies does.

Step 1: Imports – what every line is for

We load the libraries we need: one for data (pandas), one for splitting and the model (sklearn), one for evaluation metrics, and we hide warnings so the output is clean.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
import warnings
warnings.filterwarnings("ignore")

Line-by-line (assume you know nothing)

import pandas as pd — Pandas is the library for tables (DataFrames). We use it to read CSV files and work with rows/columns. pd is just a short name.
train_test_split — A function that splits our data into two parts: one for training the model (e.g. 80%) and one for testing it (e.g. 20%). We never train on the test set so we can measure real performance.
RandomForestClassifier — The actual model: many decision trees that vote. We’ll create it with parameters like max_depth and n_estimators.
confusion_matrix — Builds a table: true class vs predicted class. Rows = true, columns = predicted. Diagonal = correct predictions.
accuracy_score — (Number of correct predictions) ÷ (Total predictions). One number, e.g. 0.97 = 97%.
classification_report — For each class it prints precision, recall, F1, and support (how many samples in that class).
warnings.filterwarnings("ignore") — Stops Python from printing warning messages so we can focus on our own prints. Optional.

Step 2: Load and explore the data

We read the CSV into a DataFrame, then look at the first rows and the shape of the data. We also count how many cars fall in each class so we know if the dataset is balanced or not.

# Predict car class: unacc / acc / good / vgood
data = pd.read_csv("cars_data.csv")
data.head(10)

What each line does

pd.read_csv("cars_data.csv") — Opens the file and builds a table. Column names come from the first row. Make sure the file is in the same folder as your script (or use the full path).
data.head(10) — Shows the first 10 rows. You’ll see values like vhigh, high, 4, more, big, high, unacc. This helps you check that the file loaded correctly and see the format.

Next we check how many rows and columns we have, and how many cars are in each class:

data.info()
data['class'].value_counts()

👶 In simple terms

data.info() — Prints: 1728 rows, 7 columns, all columns are “object” (text). It also shows if there are missing values (here there are none).
data['class'].value_counts() — Counts how many cars are in each class. Typical output: unacc 1210, acc 384, good 69, vgood 65. So most cars are “unacceptable”; the dataset is imbalanced. The model might be better at predicting unacc than vgood because there are fewer vgood examples.

Step 3: Encode categories and split into train/test

We set y = the column we want to predict (class), and X = all other columns. Then we turn X from text into numbers with get_dummies. After that we split X and y into train and test sets (e.g. 80% / 20%) so we can train on one part and evaluate on the other.

y = data["class"]
X = data.drop('class', axis=1)
X = pd.get_dummies(X)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=0
)
print("Train size:", len(X_train), "Test size:", len(X_test))

What each part does

y = data["class"] — Our target: the label we want to predict (unacc, acc, good, vgood).
X = data.drop('class', axis=1) — All columns except class. axis=1 means “drop a column”. So X has buying, maint, doors, persons, lug_boot, safety.
pd.get_dummies(X) — Converts every categorical column into 0/1 columns. Example: safety becomes safety_low, safety_med, safety_high. For a row with safety = high we get 0, 0, 1. After this, X has only numbers; the number of columns increases (one per category value).
train_test_split(X, y, test_size=0.20, random_state=0) — Randomly splits rows: 80% go to X_train and y_train, 20% to X_test and y_test. random_state=0 makes the split reproducible (same every run). We use the train part to fit the model and the test part only to evaluate.
len(X_train) — Number of training rows (e.g. 1382). len(X_test) — number of test rows (e.g. 346).

Step 4: Build and train the Random Forest – every parameter explained

We create a RandomForestClassifier and set its hyperparameters. Then we call fit so it learns from the training data. Below we explain every parameter in plain English and what happens if you change it.

model = RandomForestClassifier(
    random_state=0,
    n_estimators=100,       # 100 trees (default)
    max_depth=5,           # Each tree max 5 levels deep
    min_samples_split=0.01, # Min fraction of samples to split a node
    max_features=0.8,      # Use 80% of features per split
    max_samples=0.8        # Each tree sees 80% of rows (bootstrap)
)
model.fit(X_train, y_train)

Parameter guide (ultra detail)

random_state=0 — Seeds the random number generator so bootstrap samples and feature subsets are the same every run. Good for reproducibility.
n_estimators=100 — Number of trees. More trees usually improve accuracy up to a point, then plateau. More = slower to train and predict. 100 is a safe default.
max_depth=5 — No tree can grow deeper than 5 levels. Shallow trees underfit; very deep trees overfit. 5 is a reasonable balance. If you don’t set it, trees can grow until pure (high overfitting risk).
min_samples_split=0.01 — A node is only split if it has at least 1% of the (training) samples. So we don’t split tiny groups. Can be an integer (e.g. 2) or a float (fraction).
max_features=0.8 — At each split we only consider 80% of the features (chosen at random). So each tree gets different “views” and we add diversity. Smaller = more randomness, often better generalization.
max_samples=0.8 — Each tree is trained on 80% of the training rows (drawn with replacement). So each tree sees a different bootstrap sample. This is the “bagging” part.
model.fit(X_train, y_train) — Trains all trees on the training data. After this, the model can predict the class for any new row (e.g. from X_test).

Step 5: Evaluate – confusion matrix, accuracy, precision, recall

We predict on the test set, then compare predictions to the true labels. We use a confusion matrix to see where we’re right and wrong, and we use accuracy and the classification report (precision, recall, F1) to summarize performance.

y_pred = model.predict(X_test)

print("Confusion matrix:")
print(confusion_matrix(y_test, y_pred))

print("\nAccuracy:", accuracy_score(y_test, y_pred))

print("\nClassification report:")
print(classification_report(y_test, y_pred))

What each part does

model.predict(X_test) — For each row in X_test, every tree votes; we take the majority class. Result is a list of predicted classes (one per test row).
confusion_matrix(y_test, y_pred) — Rows = true class, columns = predicted class. So cell (i, j) = “how many true class i were predicted as class j”. Diagonal = correct (true i, predicted i). Off-diagonal = errors. Example: (unacc, acc) = “how many truly unacc were predicted as acc”.
accuracy_score(y_test, y_pred) — (Total correct) ÷ (Total test rows). Single number, e.g. 0.97. Easy to understand but can be misleading if classes are imbalanced (e.g. predicting “unacc” for everyone can still give high accuracy).
classification_report(y_test, y_pred) — For each class it computes:
- Precision: Of all we predicted as that class, how many were correct? (e.g. “When we said acc, we were right 90% of the time.”)
- Recall: Of all that are truly that class, how many did we predict? (e.g. “We found 85% of all acc cars.”)
- F1-score: Harmonic mean of precision and recall; balances both.
- Support: Number of test samples in that class.

Step 6: Which features matter most? (Feature importance)

Random Forest gives us feature_importances_: a number per feature saying how much that feature was used to make splits (e.g. total decrease in Gini impurity). Higher = more important. We put these in a table and sort so we can see which columns (e.g. safety_high, persons_4) drive the prediction most.

importance = pd.DataFrame({
    'feature': X_train.columns,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)
print(importance.head(10))

👶 In simple terms

model.feature_importances_ — After training, the model has one number per feature. Bigger number = that feature was chosen more often for splits and reduced impurity more. So “safety_high” might be 0.25, “persons_4” 0.18, etc.
We build a small table (feature name + importance), sort by importance descending, and print the top 10. That tells you “what the model really uses” to decide unacc/acc/good/vgood. Useful for explaining the model to others and for feature selection.

Optional: Tuning with GridSearchCV

Instead of picking hyperparameters by hand, we can let sklearn try many combinations and pick the one with the best cross-validation score. GridSearchCV does that: you give a list of values for each parameter (e.g. max_depth: 3, 5, 7), and it trains and evaluates every combination (e.g. 3×3×3×2 = 54 models with 5-fold CV). Slower, but often finds better settings.

from sklearn.model_selection import GridSearchCV

param_grid = [
    {'max_depth': [3, 5, 7, 10],
     'min_samples_split': [0.01, 0.05, 0.1],
     'max_features': [0.7, 0.8, 1.0],
     'max_samples': [0.8, 1.0]}
]
search = GridSearchCV(RandomForestClassifier(random_state=0), param_grid, cv=5, verbose=1)
search.fit(X_train, y_train)
print("Best params:", search.best_params_)
print("Best score:", search.best_score_)

👶 In simple terms

param_grid — A dictionary (or list of dicts) of parameter names and lists of values. GridSearchCV will try every combination: max_depth=3 with max_features=0.7, then 3 with 0.8, … then 5 with 0.7, etc. cv=5 means 5-fold cross-validation: the training set is split into 5 parts; each combination is trained 5 times (each time using 4 parts for train, 1 for validation) and the average score is taken. best_params_ is the combination that had the best average score; best_score_ is that score. You can then build a final model with RandomForestClassifier(**search.best_params_) and fit on the full training set, then evaluate on the held-out test set.

Summary

Step	What we did
1	Imported pandas, sklearn (train_test_split, RandomForestClassifier, confusion_matrix, accuracy_score, classification_report), and hid warnings.
2	Loaded cars_data.csv, used head(10), info(), and value_counts() to explore rows, columns, and class balance.
3	Set y = class, X = other columns; encoded X with get_dummies(); split into X_train, X_test, y_train, y_test (80/20) with train_test_split.
4	Built RandomForestClassifier (n_estimators=100, max_depth=5, max_features=0.8, max_samples=0.8, etc.) and called fit(X_train, y_train).
5	Computed y_pred = model.predict(X_test); printed confusion_matrix, accuracy_score, and classification_report.
6	Built a DataFrame of feature names and feature_importances_, sorted by importance, and printed the top 10.
Optional	Used GridSearchCV with a param_grid and cv=5 to find best hyperparameters; printed best_params_ and best_score_.

You now have a full Random Forest pipeline with theory, dataset explanation, and every line and parameter explained. Next: try the same steps on the Decision Trees & Random Forests lesson with the Iris dataset, or move on to Boosting!

🌲 Random Forest – Ultra-Detailed Guide & Car Evaluation

What are we building?

👶 In plain English (assume you know nothing)

Part 1: Random Forest theory (ultra detail)

1.1 What is a decision tree (quick recap)

👶 Layman example

1.2 Why not use just one tree?

👶 Layman example

1.3 Bagging (Bootstrap Aggregating)

👶 What “with replacement” means

1.4 The “random” in Random Forest

1.5 How voting works

1.6 High-level flow (animated)

1.7 Glossary (every term in one place)

📖 Terms you must know

Part 2: The car evaluation dataset – every column explained

👶 Why we need to encode (turn text into numbers)

Step 1: Imports – what every line is for

Line-by-line (assume you know nothing)

Step 2: Load and explore the data

What each line does

👶 In simple terms

Step 3: Encode categories and split into train/test

What each part does

Step 4: Build and train the Random Forest – every parameter explained

Parameter guide (ultra detail)

Step 5: Evaluate – confusion matrix, accuracy, precision, recall

What each part does

Step 6: Which features matter most? (Feature importance)

👶 In simple terms

Optional: Tuning with GridSearchCV

👶 In simple terms

Summary