From zero to hero: what Random Forest is, why we use it, every idea explained like you know nothing, plus a full code walkthrough with the car dataset. With animations!
pd.read_csv("cars_data.csv") works.
Imagine you run a used-car website. For every car you have a form: buying price (very high / high / medium / low), maintenance cost, number of doors, how many people it fits, luggage size, and safety rating. You want the computer to predict the class: is this car unacc (unacceptable), acc (acceptable), good, or vgood (very good)?
Random Forest = many decision trees voting together. Each tree is like a small quiz: “Is safety high?” → Yes. “Does it fit 4+ people?” → Yes. … At the end each tree says one class. We count votes: if 70 trees say “acc” and 30 say “good”, we pick “acc”. That’s it!
We’ll also see which columns matter most (e.g. safety, number of persons) and how to measure how good the model is (confusion matrix, accuracy, precision, recall).
A decision tree is a flowchart of yes/no questions. You start at the top (root), answer each question, follow the branch, and when you reach a leaf you get a prediction (e.g. “unacc” or “acc”). The algorithm learns which questions to ask and in what order from the data by choosing splits that make each group as “pure” as possible (e.g. measured by Gini impurity or entropy).
“Is safety = high?” → Yes. “persons ≥ 4?” → Yes. “buying = low?” → No. → Predict acc. So the tree is just a bunch of rules you can follow by hand.
One deep tree often overfits: it memorizes the training data (including noise) and does worse on new data. Small changes in the data can change the tree a lot (high variance). So we want many trees that each see slightly different data and different features; when they vote, their individual mistakes tend to cancel out.
Asking one strict friend “should I buy this car?” might be biased. Asking 100 friends and taking the majority vote is usually more reliable. Random Forest is that: many “tree friends” voting.
Bagging = build many models each on a random sample of the training data, then combine their predictions (e.g. by voting). Random Forest is bagging applied to decision trees, with one extra twist: at each split inside a tree we only consider a random subset of features.
Imagine 5 rows: A, B, C, D, E. We pick 5 at random, but after each pick we put the row back. So we might get A, A, C, D, E—row A twice, B never. That’s one bootstrap sample. Another tree might get B, B, C, D, D. So each tree trains on a different “view” of the data.
Animation: Bootstrap sampling (dots = rows; they light up when “picked”)
Each tree gets a random subset of rows (with replacement). The animation suggests different rows being picked for different trees.
Two sources of randomness:
For a new car, each tree in the forest outputs one class (e.g. unacc, acc, good, vgood). We count the votes; the class with the most votes wins. (For regression we average the numeric predictions.)
Animation: Five trees voting (e.g. acc, acc, good, acc, acc → majority = acc)
The dataset has 1728 rows (cars) and 7 columns. Every column is categorical (text). There are no missing values. Here is what each column means and what values it can take.
| Column | Meaning | Possible values |
|---|---|---|
| buying | Buying price (initial cost) | vhigh, high, med, low |
| maint | Maintenance cost (yearly) | vhigh, high, med, low |
| doors | Number of doors | 2, 3, 4, 5more |
| persons | Capacity (how many people) | 2, 4, more |
| lug_boot | Luggage boot size | small, med, big |
| safety | Estimated safety level | low, med, high |
| class | Target: acceptability | unacc, acc, good, vgood |
The model can’t do math on words like “vhigh” or “high”. So we turn each category into dummy columns: one column per value, with 0 or 1. For example, buying becomes buying_vhigh, buying_high, buying_med, buying_low. For a row with buying = high we put 0, 1, 0, 0. That’s what get_dummies does.
We load the libraries we need: one for data (pandas), one for splitting and the model (sklearn), one for evaluation metrics, and we hide warnings so the output is clean.
import pandas as pd from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import confusion_matrix, accuracy_score, classification_report import warnings warnings.filterwarnings("ignore")
pd is just a short name.We read the CSV into a DataFrame, then look at the first rows and the shape of the data. We also count how many cars fall in each class so we know if the dataset is balanced or not.
# Predict car class: unacc / acc / good / vgood data = pd.read_csv("cars_data.csv") data.head(10)
Next we check how many rows and columns we have, and how many cars are in each class:
data.info() data['class'].value_counts()
We set y = the column we want to predict (class), and X = all other columns. Then we turn X from text into numbers with get_dummies. After that we split X and y into train and test sets (e.g. 80% / 20%) so we can train on one part and evaluate on the other.
y = data["class"] X = data.drop('class', axis=1) X = pd.get_dummies(X) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.20, random_state=0 ) print("Train size:", len(X_train), "Test size:", len(X_test))
class. axis=1 means “drop a column”. So X has buying, maint, doors, persons, lug_boot, safety.safety becomes safety_low, safety_med, safety_high. For a row with safety = high we get 0, 0, 1. After this, X has only numbers; the number of columns increases (one per category value).random_state=0 makes the split reproducible (same every run). We use the train part to fit the model and the test part only to evaluate.We create a RandomForestClassifier and set its hyperparameters. Then we call fit so it learns from the training data. Below we explain every parameter in plain English and what happens if you change it.
model = RandomForestClassifier( random_state=0, n_estimators=100, # 100 trees (default) max_depth=5, # Each tree max 5 levels deep min_samples_split=0.01, # Min fraction of samples to split a node max_features=0.8, # Use 80% of features per split max_samples=0.8 # Each tree sees 80% of rows (bootstrap) ) model.fit(X_train, y_train)
We predict on the test set, then compare predictions to the true labels. We use a confusion matrix to see where we’re right and wrong, and we use accuracy and the classification report (precision, recall, F1) to summarize performance.
y_pred = model.predict(X_test) print("Confusion matrix:") print(confusion_matrix(y_test, y_pred)) print("\nAccuracy:", accuracy_score(y_test, y_pred)) print("\nClassification report:") print(classification_report(y_test, y_pred))
Random Forest gives us feature_importances_: a number per feature saying how much that feature was used to make splits (e.g. total decrease in Gini impurity). Higher = more important. We put these in a table and sort so we can see which columns (e.g. safety_high, persons_4) drive the prediction most.
importance = pd.DataFrame({ 'feature': X_train.columns, 'importance': model.feature_importances_ }).sort_values('importance', ascending=False) print(importance.head(10))
Instead of picking hyperparameters by hand, we can let sklearn try many combinations and pick the one with the best cross-validation score. GridSearchCV does that: you give a list of values for each parameter (e.g. max_depth: 3, 5, 7), and it trains and evaluates every combination (e.g. 3×3×3×2 = 54 models with 5-fold CV). Slower, but often finds better settings.
from sklearn.model_selection import GridSearchCV param_grid = [ {'max_depth': [3, 5, 7, 10], 'min_samples_split': [0.01, 0.05, 0.1], 'max_features': [0.7, 0.8, 1.0], 'max_samples': [0.8, 1.0]} ] search = GridSearchCV(RandomForestClassifier(random_state=0), param_grid, cv=5, verbose=1) search.fit(X_train, y_train) print("Best params:", search.best_params_) print("Best score:", search.best_score_)
param_grid — A dictionary (or list of dicts) of parameter names and lists of values. GridSearchCV will try every combination: max_depth=3 with max_features=0.7, then 3 with 0.8, … then 5 with 0.7, etc. cv=5 means 5-fold cross-validation: the training set is split into 5 parts; each combination is trained 5 times (each time using 4 parts for train, 1 for validation) and the average score is taken. best_params_ is the combination that had the best average score; best_score_ is that score. You can then build a final model with RandomForestClassifier(**search.best_params_) and fit on the full training set, then evaluate on the held-out test set.
| Step | What we did |
|---|---|
| 1 | Imported pandas, sklearn (train_test_split, RandomForestClassifier, confusion_matrix, accuracy_score, classification_report), and hid warnings. |
| 2 | Loaded cars_data.csv, used head(10), info(), and value_counts() to explore rows, columns, and class balance. |
| 3 | Set y = class, X = other columns; encoded X with get_dummies(); split into X_train, X_test, y_train, y_test (80/20) with train_test_split. |
| 4 | Built RandomForestClassifier (n_estimators=100, max_depth=5, max_features=0.8, max_samples=0.8, etc.) and called fit(X_train, y_train). |
| 5 | Computed y_pred = model.predict(X_test); printed confusion_matrix, accuracy_score, and classification_report. |
| 6 | Built a DataFrame of feature names and feature_importances_, sorted by importance, and printed the top 10. |
| Optional | Used GridSearchCV with a param_grid and cv=5 to find best hyperparameters; printed best_params_ and best_score_. |
You now have a full Random Forest pipeline with theory, dataset explanation, and every line and parameter explained. Next: try the same steps on the Decision Trees & Random Forests lesson with the Iris dataset, or move on to Boosting!