🌲 Random Forest – Ultra-Detailed Guide & Car Evaluation

From zero to hero: what Random Forest is, why we use it, every idea explained like you know nothing, plus a full code walkthrough with the car dataset. With animations!

📥 Download the dataset first: cars_data.csv — Save it in the same folder as your script so pd.read_csv("cars_data.csv") works.

What are we building?

👶 In plain English (assume you know nothing)

Imagine you run a used-car website. For every car you have a form: buying price (very high / high / medium / low), maintenance cost, number of doors, how many people it fits, luggage size, and safety rating. You want the computer to predict the class: is this car unacc (unacceptable), acc (acceptable), good, or vgood (very good)?

Random Forest = many decision trees voting together. Each tree is like a small quiz: “Is safety high?” → Yes. “Does it fit 4+ people?” → Yes. … At the end each tree says one class. We count votes: if 70 trees say “acc” and 30 say “good”, we pick “acc”. That’s it!

We’ll also see which columns matter most (e.g. safety, number of persons) and how to measure how good the model is (confusion matrix, accuracy, precision, recall).

Part 1: Random Forest theory (ultra detail)

1.1 What is a decision tree (quick recap)

A decision tree is a flowchart of yes/no questions. You start at the top (root), answer each question, follow the branch, and when you reach a leaf you get a prediction (e.g. “unacc” or “acc”). The algorithm learns which questions to ask and in what order from the data by choosing splits that make each group as “pure” as possible (e.g. measured by Gini impurity or entropy).

👶 Layman example

“Is safety = high?” → Yes. “persons ≥ 4?” → Yes. “buying = low?” → No. → Predict acc. So the tree is just a bunch of rules you can follow by hand.

1.2 Why not use just one tree?

One deep tree often overfits: it memorizes the training data (including noise) and does worse on new data. Small changes in the data can change the tree a lot (high variance). So we want many trees that each see slightly different data and different features; when they vote, their individual mistakes tend to cancel out.

👶 Layman example

Asking one strict friend “should I buy this car?” might be biased. Asking 100 friends and taking the majority vote is usually more reliable. Random Forest is that: many “tree friends” voting.

1.3 Bagging (Bootstrap Aggregating)

Bagging = build many models each on a random sample of the training data, then combine their predictions (e.g. by voting). Random Forest is bagging applied to decision trees, with one extra twist: at each split inside a tree we only consider a random subset of features.

👶 What “with replacement” means

Imagine 5 rows: A, B, C, D, E. We pick 5 at random, but after each pick we put the row back. So we might get A, A, C, D, E—row A twice, B never. That’s one bootstrap sample. Another tree might get B, B, C, D, D. So each tree trains on a different “view” of the data.

Animation: Bootstrap sampling (dots = rows; they light up when “picked”)

123456

Each tree gets a random subset of rows (with replacement). The animation suggests different rows being picked for different trees.

1.4 The “random” in Random Forest

Two sources of randomness:

  1. Random data (bootstrap): Each tree is trained on a bootstrap sample of the training set.
  2. Random features: At every split, the algorithm only considers a random subset of the features (e.g. 80% or √p). So Tree 1 might split first on “safety”, Tree 2 on “persons”, etc. That makes trees more diverse and reduces overfitting.

1.5 How voting works

For a new car, each tree in the forest outputs one class (e.g. unacc, acc, good, vgood). We count the votes; the class with the most votes wins. (For regression we average the numeric predictions.)

Animation: Five trees voting (e.g. acc, acc, good, acc, acc → majority = acc)

accaccgoodaccacc acc (majority)

1.6 High-level flow (animated)

📊 Data 🌲 Tree 1 🌲 Tree 2 🌲 … 🗳️ Vote ✅ Prediction

1.7 Glossary (every term in one place)

📖 Terms you must know

Root node
The top of the tree; the first question.
Split
Using one feature (e.g. safety = high?) to divide data into two groups.
Leaf
End of the tree; no more splits; the prediction for that path.
Bootstrap
Drawing a random sample with replacement from the training set.
Bagging
Training many models on bootstrap samples and combining their predictions (e.g. voting).
max_depth
Maximum number of levels in each tree. Deeper = more complex, risk of overfitting.
max_features
How many (or what fraction of) features to consider at each split. Smaller = more randomness, more diversity.
max_samples
Fraction (or count) of rows each tree uses. Less than 1.0 = each tree sees a subset (bootstrap).
n_estimators
Number of trees in the forest. More = often better accuracy but slower.
Confusion matrix
Table: rows = true class, columns = predicted. Diagonal = correct; off-diagonal = errors.
Precision
Of all we predicted as class X, how many were really X? (Correct X / Predicted X)
Recall
Of all real X, how many did we predict as X? (Correct X / Actual X)
F1 score
Harmonic mean of precision and recall; one number for “how good” for that class.
Feature importance
How much each feature helped when splitting (e.g. total Gini decrease); higher = more important.

Part 2: The car evaluation dataset – every column explained

The dataset has 1728 rows (cars) and 7 columns. Every column is categorical (text). There are no missing values. Here is what each column means and what values it can take.

ColumnMeaningPossible values
buyingBuying price (initial cost)vhigh, high, med, low
maintMaintenance cost (yearly)vhigh, high, med, low
doorsNumber of doors2, 3, 4, 5more
personsCapacity (how many people)2, 4, more
lug_bootLuggage boot sizesmall, med, big
safetyEstimated safety levellow, med, high
classTarget: acceptabilityunacc, acc, good, vgood

👶 Why we need to encode (turn text into numbers)

The model can’t do math on words like “vhigh” or “high”. So we turn each category into dummy columns: one column per value, with 0 or 1. For example, buying becomes buying_vhigh, buying_high, buying_med, buying_low. For a row with buying = high we put 0, 1, 0, 0. That’s what get_dummies does.

Step 1: Imports – what every line is for

We load the libraries we need: one for data (pandas), one for splitting and the model (sklearn), one for evaluation metrics, and we hide warnings so the output is clean.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
import warnings
warnings.filterwarnings("ignore")

Line-by-line (assume you know nothing)

  • import pandas as pd — Pandas is the library for tables (DataFrames). We use it to read CSV files and work with rows/columns. pd is just a short name.
  • train_test_split — A function that splits our data into two parts: one for training the model (e.g. 80%) and one for testing it (e.g. 20%). We never train on the test set so we can measure real performance.
  • RandomForestClassifier — The actual model: many decision trees that vote. We’ll create it with parameters like max_depth and n_estimators.
  • confusion_matrix — Builds a table: true class vs predicted class. Rows = true, columns = predicted. Diagonal = correct predictions.
  • accuracy_score — (Number of correct predictions) ÷ (Total predictions). One number, e.g. 0.97 = 97%.
  • classification_report — For each class it prints precision, recall, F1, and support (how many samples in that class).
  • warnings.filterwarnings("ignore") — Stops Python from printing warning messages so we can focus on our own prints. Optional.

Step 2: Load and explore the data

We read the CSV into a DataFrame, then look at the first rows and the shape of the data. We also count how many cars fall in each class so we know if the dataset is balanced or not.

# Predict car class: unacc / acc / good / vgood
data = pd.read_csv("cars_data.csv")
data.head(10)

What each line does

  • pd.read_csv("cars_data.csv") — Opens the file and builds a table. Column names come from the first row. Make sure the file is in the same folder as your script (or use the full path).
  • data.head(10) — Shows the first 10 rows. You’ll see values like vhigh, high, 4, more, big, high, unacc. This helps you check that the file loaded correctly and see the format.

Next we check how many rows and columns we have, and how many cars are in each class:

data.info()
data['class'].value_counts()

👶 In simple terms

  • data.info() — Prints: 1728 rows, 7 columns, all columns are “object” (text). It also shows if there are missing values (here there are none).
  • data['class'].value_counts() — Counts how many cars are in each class. Typical output: unacc 1210, acc 384, good 69, vgood 65. So most cars are “unacceptable”; the dataset is imbalanced. The model might be better at predicting unacc than vgood because there are fewer vgood examples.

Step 3: Encode categories and split into train/test

We set y = the column we want to predict (class), and X = all other columns. Then we turn X from text into numbers with get_dummies. After that we split X and y into train and test sets (e.g. 80% / 20%) so we can train on one part and evaluate on the other.

y = data["class"]
X = data.drop('class', axis=1)
X = pd.get_dummies(X)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=0
)
print("Train size:", len(X_train), "Test size:", len(X_test))

What each part does

  • y = data["class"] — Our target: the label we want to predict (unacc, acc, good, vgood).
  • X = data.drop('class', axis=1) — All columns except class. axis=1 means “drop a column”. So X has buying, maint, doors, persons, lug_boot, safety.
  • pd.get_dummies(X) — Converts every categorical column into 0/1 columns. Example: safety becomes safety_low, safety_med, safety_high. For a row with safety = high we get 0, 0, 1. After this, X has only numbers; the number of columns increases (one per category value).
  • train_test_split(X, y, test_size=0.20, random_state=0) — Randomly splits rows: 80% go to X_train and y_train, 20% to X_test and y_test. random_state=0 makes the split reproducible (same every run). We use the train part to fit the model and the test part only to evaluate.
  • len(X_train) — Number of training rows (e.g. 1382). len(X_test) — number of test rows (e.g. 346).

Step 4: Build and train the Random Forest – every parameter explained

We create a RandomForestClassifier and set its hyperparameters. Then we call fit so it learns from the training data. Below we explain every parameter in plain English and what happens if you change it.

model = RandomForestClassifier(
    random_state=0,
    n_estimators=100,       # 100 trees (default)
    max_depth=5,           # Each tree max 5 levels deep
    min_samples_split=0.01, # Min fraction of samples to split a node
    max_features=0.8,      # Use 80% of features per split
    max_samples=0.8        # Each tree sees 80% of rows (bootstrap)
)
model.fit(X_train, y_train)

Parameter guide (ultra detail)

  • random_state=0 — Seeds the random number generator so bootstrap samples and feature subsets are the same every run. Good for reproducibility.
  • n_estimators=100 — Number of trees. More trees usually improve accuracy up to a point, then plateau. More = slower to train and predict. 100 is a safe default.
  • max_depth=5 — No tree can grow deeper than 5 levels. Shallow trees underfit; very deep trees overfit. 5 is a reasonable balance. If you don’t set it, trees can grow until pure (high overfitting risk).
  • min_samples_split=0.01 — A node is only split if it has at least 1% of the (training) samples. So we don’t split tiny groups. Can be an integer (e.g. 2) or a float (fraction).
  • max_features=0.8 — At each split we only consider 80% of the features (chosen at random). So each tree gets different “views” and we add diversity. Smaller = more randomness, often better generalization.
  • max_samples=0.8 — Each tree is trained on 80% of the training rows (drawn with replacement). So each tree sees a different bootstrap sample. This is the “bagging” part.
  • model.fit(X_train, y_train) — Trains all trees on the training data. After this, the model can predict the class for any new row (e.g. from X_test).

Step 5: Evaluate – confusion matrix, accuracy, precision, recall

We predict on the test set, then compare predictions to the true labels. We use a confusion matrix to see where we’re right and wrong, and we use accuracy and the classification report (precision, recall, F1) to summarize performance.

y_pred = model.predict(X_test)

print("Confusion matrix:")
print(confusion_matrix(y_test, y_pred))

print("\nAccuracy:", accuracy_score(y_test, y_pred))

print("\nClassification report:")
print(classification_report(y_test, y_pred))

What each part does

  • model.predict(X_test) — For each row in X_test, every tree votes; we take the majority class. Result is a list of predicted classes (one per test row).
  • confusion_matrix(y_test, y_pred) — Rows = true class, columns = predicted class. So cell (i, j) = “how many true class i were predicted as class j”. Diagonal = correct (true i, predicted i). Off-diagonal = errors. Example: (unacc, acc) = “how many truly unacc were predicted as acc”.
  • accuracy_score(y_test, y_pred) — (Total correct) ÷ (Total test rows). Single number, e.g. 0.97. Easy to understand but can be misleading if classes are imbalanced (e.g. predicting “unacc” for everyone can still give high accuracy).
  • classification_report(y_test, y_pred) — For each class it computes:
    • Precision: Of all we predicted as that class, how many were correct? (e.g. “When we said acc, we were right 90% of the time.”)
    • Recall: Of all that are truly that class, how many did we predict? (e.g. “We found 85% of all acc cars.”)
    • F1-score: Harmonic mean of precision and recall; balances both.
    • Support: Number of test samples in that class.

Step 6: Which features matter most? (Feature importance)

Random Forest gives us feature_importances_: a number per feature saying how much that feature was used to make splits (e.g. total decrease in Gini impurity). Higher = more important. We put these in a table and sort so we can see which columns (e.g. safety_high, persons_4) drive the prediction most.

importance = pd.DataFrame({
    'feature': X_train.columns,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)
print(importance.head(10))

👶 In simple terms

  • model.feature_importances_ — After training, the model has one number per feature. Bigger number = that feature was chosen more often for splits and reduced impurity more. So “safety_high” might be 0.25, “persons_4” 0.18, etc.
  • We build a small table (feature name + importance), sort by importance descending, and print the top 10. That tells you “what the model really uses” to decide unacc/acc/good/vgood. Useful for explaining the model to others and for feature selection.

Optional: Tuning with GridSearchCV

Instead of picking hyperparameters by hand, we can let sklearn try many combinations and pick the one with the best cross-validation score. GridSearchCV does that: you give a list of values for each parameter (e.g. max_depth: 3, 5, 7), and it trains and evaluates every combination (e.g. 3×3×3×2 = 54 models with 5-fold CV). Slower, but often finds better settings.

from sklearn.model_selection import GridSearchCV

param_grid = [
    {'max_depth': [3, 5, 7, 10],
     'min_samples_split': [0.01, 0.05, 0.1],
     'max_features': [0.7, 0.8, 1.0],
     'max_samples': [0.8, 1.0]}
]
search = GridSearchCV(RandomForestClassifier(random_state=0), param_grid, cv=5, verbose=1)
search.fit(X_train, y_train)
print("Best params:", search.best_params_)
print("Best score:", search.best_score_)

👶 In simple terms

param_grid — A dictionary (or list of dicts) of parameter names and lists of values. GridSearchCV will try every combination: max_depth=3 with max_features=0.7, then 3 with 0.8, … then 5 with 0.7, etc. cv=5 means 5-fold cross-validation: the training set is split into 5 parts; each combination is trained 5 times (each time using 4 parts for train, 1 for validation) and the average score is taken. best_params_ is the combination that had the best average score; best_score_ is that score. You can then build a final model with RandomForestClassifier(**search.best_params_) and fit on the full training set, then evaluate on the held-out test set.

Summary

StepWhat we did
1Imported pandas, sklearn (train_test_split, RandomForestClassifier, confusion_matrix, accuracy_score, classification_report), and hid warnings.
2Loaded cars_data.csv, used head(10), info(), and value_counts() to explore rows, columns, and class balance.
3Set y = class, X = other columns; encoded X with get_dummies(); split into X_train, X_test, y_train, y_test (80/20) with train_test_split.
4Built RandomForestClassifier (n_estimators=100, max_depth=5, max_features=0.8, max_samples=0.8, etc.) and called fit(X_train, y_train).
5Computed y_pred = model.predict(X_test); printed confusion_matrix, accuracy_score, and classification_report.
6Built a DataFrame of feature names and feature_importances_, sorted by importance, and printed the top 10.
OptionalUsed GridSearchCV with a param_grid and cv=5 to find best hyperparameters; printed best_params_ and best_score_.

You now have a full Random Forest pipeline with theory, dataset explanation, and every line and parameter explained. Next: try the same steps on the Decision Trees & Random Forests lesson with the Iris dataset, or move on to Boosting!