👶 ABSOLUTE BEGINNER FRIENDLY

👥 k-Nearest Neighbors (kNN)

Classify by asking: "Who are my nearest neighbors?" No fancy math—just look around and vote!

Part 1: What is k-Nearest Neighbors?

k-Nearest Neighbors (kNN) is a classification (and sometimes regression) algorithm. To predict the class of a new point, it looks at the K closest training points and lets them vote. Majority wins!

👶 In One Sentence (Like You're 5)

kNN means: "When someone new moves into the neighborhood, we don't ask them who they are—we look at their K closest neighbors. If 4 out of 5 neighbors have a dog, we guess they probably have a dog too."

👶 Explain Like I'm 5

Imagine you're at a party and you don't know if the music is "cool" or "uncool." You look at the 5 people standing closest to you. If 4 of them are dancing and 1 is on their phone, you guess the music is probably cool! 🎵

kNN does exactly this: for any new data point, it finds the K nearest points it already knows, checks their labels, and picks the label that appears most often.

🏘️ The Neighborhood Analogy

"Tell me who your friends are, and I'll tell you who you are."

If most of your 5 nearest neighbors like pizza, you probably like pizza too. If most of them are "Spam" in an email dataset, the new email is probably Spam. kNN doesn't learn a formula—it just remembers all the training data and uses it at prediction time.

Where kNN Is Used

Use Case How kNN Helps
Recommendation "Users who liked this also liked…" — find K nearest users and suggest what they bought
Image recognition Compare a new image to K most similar stored images; assign the most common label
Credit / fraud Is this transaction similar to past frauds? Look at K nearest past transactions
Medical Patient with these symptoms — find K similar past patients and see their outcomes

Part 2: How kNN Works (Step by Step)

kNN is lazy: it doesn't "train" in the usual sense. At prediction time it does all the work.

1
Store all training data

Keep every (features, label) pair in memory. There is no "model" except this dataset.

2
When a new point arrives, compute distances

Measure the distance from the new point to every training point (e.g. Euclidean distance).

3
Pick the K smallest distances

Those K points are the "K nearest neighbors."

4
Vote

Count how many of those K neighbors have each label. The label with the most votes wins. (For regression, you'd take the average of their target values.)

📊 Idea in 2D

Imagine 2 features (e.g. height, weight). Red dots = "Like pizza." Blue dots = "Don't like pizza." The ? is the new person.

🔴 🔴 🔵 🔴 🔵 → ? ← 🔴 🔴 🔴 🔵 🔴

If K=5, we take the 5 closest dots to ?. Suppose 3 are red and 2 are blue. kNN predicts: Like pizza (red).

Part 3: Distance Metrics

"Nearest" means we need a way to measure distance between two points. The choice of distance changes who the "neighbors" are.

Euclidean Distance (Default)

Straight-line distance. For two points \((x_1, y_1)\) and \((x_2, y_2)\) in 2D:

distance = √[(x₂ − x₁)² + (y₂ − y₁)²]

With more features, you add more squared differences under the square root. This is the default in sklearn and works well when features are on similar scales.

Manhattan Distance

Sum of absolute differences along each axis—like walking on a grid (city blocks):

distance = |x₂ − x₁| + |y₂ − y₁| + …

Useful when movement is along axes (e.g. streets) or when outliers should matter less than in Euclidean.

Hamming Distance

Hamming distance counts how many positions are different between two vectors of the same length. Each position is compared: if they match, add 0; if they differ, add 1. It's the number of "flips" needed to turn one into the other.

distance = (number of positions where value₁ ≠ value₂)

Often we use the normalized Hamming distance: (number of differing positions) / (total number of positions), so the result is between 0 (identical) and 1 (all different).

👶 Example (Like You're 5)

Two binary strings: A = 1 0 1 1 0 and B = 1 1 1 0 0.

Compare position by position: same, different, same, different, same. So 2 positions differ → Hamming distance = 2. Normalized: 2/5 = 0.4.

Use Hamming when your features are binary (0/1) or categorical encoded as 0/1 (e.g. one-hot). Then "distance" means "how many features disagree?"

🔤 When to Use Hamming

Binary or categorical data: Survey answers (Yes=1, No=0), DNA bases, binary flags (has_fraud, is_premium), or one-hot encoded categories. Euclidean doesn't make sense for "different category" the same way—Hamming counts "how many attributes disagree."

In sklearn: Use metric='hamming' when your feature matrix is binary or multi-label (0/1 per position). The classifier will use Hamming distance to find nearest neighbors.

Metric When to Use
euclidean (default) Most cases, especially after scaling
manhattan High-dimensional or grid-like structure; less sensitive to large differences in one feature
hamming Binary (0/1) or categorical data; counts how many positions differ
minkowski General form; you can tune the power (p=2 → Euclidean, p=1 → Manhattan)

Part 4: Choosing K

K is the only main hyperparameter: how many neighbors get to vote?

💡 Rules of Thumb

  • K too small (e.g. K=1): Very sensitive to noise. One weird neighbor can flip the prediction → overfitting.
  • K too large: You include far-away points that aren't really "neighbors." The decision boundary gets smooth and you lose local structure → underfitting.
  • Odd K (3, 5, 7…): For binary classification, odd K avoids ties. For multi-class, ties can still happen; sklearn breaks them by default.
  • Typical range: Try K from 3 to about 20 and use cross-validation to pick the best.

🎯 Finding the Right K

Plot accuracy (or your metric) vs K. Often you'll see accuracy improve as K increases from 1, then flatten or drop. The "sweet spot" is usually in that flat or peak region. Use GridSearchCV or a simple loop with cross_val_score to find it.

🎮 Interactive: Change K and See Who Gets to Vote!

The star ★ is the new point. Drag the slider to change K. The circles show which neighbors vote. Watch the prediction change!

New Point K=3 → Prediction: Blue
K = 3

K=1 is noisy (one neighbor decides everything). K=9 may include too-far points. K=3 or K=5 is usually the sweet spot!

Part 5: Weighted kNN

So far we assumed every one of the K neighbors gets one vote. In weighted kNN, closer neighbors get more influence than farther ones. A neighbor right next to the new point should matter more than one that barely made it into the "top K."

👶 In One Sentence

Weighted kNN means: "Not all neighbors are equal. The closer a neighbor is, the more we trust their vote." We assign a weight to each neighbor (usually based on distance) and use those weights when voting or averaging.

👶 Example (Like You're 5)

You're asking 5 people whether to get pizza or pasta. Two live next door (very close), three live down the street (farther). In uniform kNN, all 5 get one vote. In weighted kNN, the two next-door neighbors' votes count more—so if both say "pizza," their opinion can outweigh the three who said "pasta."

How the Weights Work

The most common choice is inverse distance: weight = 1 / distance. So:

Closer neighbors get a larger weight and therefore more say in the vote. For classification we sum the weights per class (instead of counting votes) and pick the class with the highest total weight. For regression we take the weighted average of the neighbors' target values.

📐 Inverse Distance Formula

If neighbor i is at distance di, its weight is wi = 1 / di. To avoid division by zero when a neighbor is at distance 0 (same point), sklearn and others often use a small epsilon or treat that point as having the largest weight.

💡 When to Use Weighted kNN

  • Use weights='distance' when your K neighbors can be at very different distances—e.g. one is very close and four are farther. Letting the close one(s) dominate often improves predictions.
  • Use weights='uniform' (default) when you want simplicity or when all K neighbors tend to be at similar distances. Uniform is also more stable if distances are noisy.
  • You can try both and compare with cross-validation; on many datasets weights='distance' gives a small accuracy boost.
# Weighted kNN: closer neighbors count more
knn_weighted = KNeighborsClassifier(n_neighbors=5, weights='distance')
knn_weighted.fit(X_train_scaled, y_train)
y_pred = knn_weighted.predict(X_test_scaled)

Part 6: Anomaly Detection with kNN

Anomaly detection means finding points that don't "belong"—they're unusual compared to the rest. Fraudulent transactions, faulty sensors, rare diseases, or defective products are examples. kNN is a natural fit: points that are far from their neighbors are potential anomalies.

👶 In One Sentence

kNN for anomaly detection means: "If a point has no close neighbors, it's suspicious. Normal points sit near others; anomalies sit alone." We use the distance to the Kth nearest neighbor (or the average distance to K neighbors) as an anomaly score: the larger the distance, the more likely the point is an anomaly.

👶 Example (Like You're 5)

Imagine a playground. Most kids are in clusters—near friends. One kid is standing far from everyone. That kid is an anomaly. We don't need labels (who's "normal" or "weird"); we just notice: "This point's nearest neighbors are very far away." In kNN terms: for that kid, the distance to the 5th nearest kid is huge compared to other kids. So we flag them as different.

How It Works

  1. For each point, find its K nearest neighbors and get their distances.
  2. Use the distance to the Kth neighbor (or the mean distance to all K neighbors) as that point's anomaly score.
  3. Points with a high score are far from others → flag them as anomalies. You can set a threshold (e.g. "score > 10") or use a percentile (e.g. top 5% of scores are anomalies).

📐 Why Distance to Kth Neighbor?

Using the Kth neighbor (not the 1st) makes the score more stable: one random nearby point doesn't hide an otherwise isolated point. So we ask: "How far do I have to go to find K friends?" If you have to go very far, you're an outlier.

💡 When to Use kNN for Anomaly Detection

  • Unsupervised: You don't have labels like "normal" vs "fraud." You just have data and want to find rare or unusual points.
  • Fraud, intrusion, quality: Transactions, network traffic, sensor readings, manufacturing—anything where "different" means "worth checking."
  • Scaling matters: Same as for classification—scale features so distance is meaningful.
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import StandardScaler
import numpy as np

# Scale the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Fit KNN: we only need distances, not labels
K = 5
nn = NearestNeighbors(n_neighbors=K)
nn.fit(X_scaled)

# Distance to Kth neighbor for each point (anomaly score)
distances, indices = nn.kneighbors(X_scaled)
# distances[:, -1] = distance to the Kth (last) neighbor
anomaly_score = distances[:, -1]

# Flag top 5% as anomalies (e.g. threshold = 95th percentile)
threshold = np.percentile(anomaly_score, 95)
is_anomaly = anomaly_score > threshold
print(f"Number of anomalies: {is_anomaly.sum()}")

Alternative: scikit-learn's LocalOutlierFactor (LOF) is a kNN-based anomaly detector that compares each point's local density to its neighbors'. Use from sklearn.neighbors import LocalOutlierFactor and fit_predict(X) for a -1 (anomaly) / 1 (normal) label per point.

Part 7: Why Scaling Matters

kNN uses distance. If one feature is in thousands (e.g. income) and another is 0–10 (e.g. satisfaction score), the big numbers will dominate and the small feature will barely matter.

⚠️ Always Scale for kNN

Use StandardScaler or MinMaxScaler so every feature is on a similar scale (e.g. mean 0, std 1, or 0–1). Otherwise "nearest" is decided almost only by the feature with the largest range.

👶 Analogy

Comparing "height in cm" (e.g. 170) and "number of siblings" (e.g. 2). Without scaling, a difference of 10 in height would swamp a difference of 2 in siblings. After scaling, both contribute fairly to distance.

Part 8: kNN in Python

Using scikit-learn: load data, split, scale, fit, predict, and (optionally) find the best K.

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler

# 1. Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 2. Scale (important for kNN!)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 3. Create and fit kNN with K=5
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)

# 4. Predict and score
y_pred = knn.predict(X_test_scaled)
print("Accuracy:", knn.score(X_test_scaled, y_test))

Finding the Best K with Cross-Validation

from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt

scores = []
k_range = range(1, 25)

for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    cv_scores = cross_val_score(knn, X_train_scaled, y_train, cv=5)
    scores.append(cv_scores.mean())

# Plot: often best K is in the middle
plt.plot(k_range, scores)
plt.xlabel('K')
plt.ylabel('CV Accuracy')
plt.title('Finding Optimal K')
plt.show()

Weighted kNN: Use weights='distance' so closer neighbors count more (see Part 5: Weighted kNN). Default is weights='uniform' (one vote per neighbor).

Using Hamming Distance (Binary / Categorical Data)

When your features are binary (0/1) or one-hot encoded, use metric='hamming' so "nearest" means "fewest disagreeing positions":

# For binary or one-hot encoded features
knn_hamming = KNeighborsClassifier(n_neighbors=5, metric='hamming')
knn_hamming.fit(X_binary_train, y_train)
y_pred = knn_hamming.predict(X_binary_test)

Part 9: When to Use kNN — Pros & Cons

Pros Cons
Simple: no "training" step, easy to explain Slow at prediction when dataset is huge (must compute distance to every point)
No assumptions about the shape of the data Curse of dimensionality: in many dimensions, "nearest" neighbors can be far; accuracy drops
Works for classification and regression Needs scaling; sensitive to irrelevant features
Good baseline; often works well for small–medium data You must store the whole training set (memory)

💡 When kNN Shines

  • Small to medium datasets (thousands to tens of thousands of points)
  • Not too many features (or after dimensionality reduction)
  • When you want a simple, interpretable baseline ("we predict based on similar past cases")
  • When decision boundaries are irregular (non-linear) and you don't need a compact model

Part 10: Summary

Next, learn about Support Vector Machines (SVM) - the maximum margin classifier with the kernel trick! Or jump to Decision Trees & Random Forests.