Classify by asking: "Who are my nearest neighbors?" No fancy math—just look around and vote!
k-Nearest Neighbors (kNN) is a classification (and sometimes regression) algorithm. To predict the class of a new point, it looks at the K closest training points and lets them vote. Majority wins!
kNN means: "When someone new moves into the neighborhood, we don't ask them who they are—we look at their K closest neighbors. If 4 out of 5 neighbors have a dog, we guess they probably have a dog too."
Imagine you're at a party and you don't know if the music is "cool" or "uncool." You look at the 5 people standing closest to you. If 4 of them are dancing and 1 is on their phone, you guess the music is probably cool! 🎵
kNN does exactly this: for any new data point, it finds the K nearest points it already knows, checks their labels, and picks the label that appears most often.
"Tell me who your friends are, and I'll tell you who you are."
If most of your 5 nearest neighbors like pizza, you probably like pizza too. If most of them are "Spam" in an email dataset, the new email is probably Spam. kNN doesn't learn a formula—it just remembers all the training data and uses it at prediction time.
| Use Case | How kNN Helps |
|---|---|
| Recommendation | "Users who liked this also liked…" — find K nearest users and suggest what they bought |
| Image recognition | Compare a new image to K most similar stored images; assign the most common label |
| Credit / fraud | Is this transaction similar to past frauds? Look at K nearest past transactions |
| Medical | Patient with these symptoms — find K similar past patients and see their outcomes |
kNN is lazy: it doesn't "train" in the usual sense. At prediction time it does all the work.
Keep every (features, label) pair in memory. There is no "model" except this dataset.
Measure the distance from the new point to every training point (e.g. Euclidean distance).
Those K points are the "K nearest neighbors."
Count how many of those K neighbors have each label. The label with the most votes wins. (For regression, you'd take the average of their target values.)
Imagine 2 features (e.g. height, weight). Red dots = "Like pizza." Blue dots = "Don't like pizza." The ? is the new person.
🔴 🔴 🔵 🔴 🔵 → ? ← 🔴 🔴 🔴 🔵 🔴
If K=5, we take the 5 closest dots to ?. Suppose 3 are red and 2 are blue. kNN predicts: Like pizza (red).
"Nearest" means we need a way to measure distance between two points. The choice of distance changes who the "neighbors" are.
Straight-line distance. For two points \((x_1, y_1)\) and \((x_2, y_2)\) in 2D:
distance = √[(x₂ − x₁)² + (y₂ − y₁)²]
With more features, you add more squared differences under the square root. This is the default in sklearn and works well when features are on similar scales.
Sum of absolute differences along each axis—like walking on a grid (city blocks):
distance = |x₂ − x₁| + |y₂ − y₁| + …
Useful when movement is along axes (e.g. streets) or when outliers should matter less than in Euclidean.
Hamming distance counts how many positions are different between two vectors of the same length. Each position is compared: if they match, add 0; if they differ, add 1. It's the number of "flips" needed to turn one into the other.
distance = (number of positions where value₁ ≠ value₂)
Often we use the normalized Hamming distance: (number of differing positions) / (total number of positions), so the result is between 0 (identical) and 1 (all different).
Two binary strings: A = 1 0 1 1 0 and B = 1 1 1 0 0.
Compare position by position: same, different, same, different, same. So 2 positions differ → Hamming distance = 2. Normalized: 2/5 = 0.4.
Use Hamming when your features are binary (0/1) or categorical encoded as 0/1 (e.g. one-hot). Then "distance" means "how many features disagree?"
Binary or categorical data: Survey answers (Yes=1, No=0), DNA bases, binary flags (has_fraud, is_premium), or one-hot encoded categories. Euclidean doesn't make sense for "different category" the same way—Hamming counts "how many attributes disagree."
In sklearn: Use metric='hamming' when your feature matrix is binary or multi-label (0/1 per position). The classifier will use Hamming distance to find nearest neighbors.
| Metric | When to Use |
|---|---|
| euclidean (default) | Most cases, especially after scaling |
| manhattan | High-dimensional or grid-like structure; less sensitive to large differences in one feature |
| hamming | Binary (0/1) or categorical data; counts how many positions differ |
| minkowski | General form; you can tune the power (p=2 → Euclidean, p=1 → Manhattan) |
K is the only main hyperparameter: how many neighbors get to vote?
Plot accuracy (or your metric) vs K. Often you'll see accuracy improve as K increases from 1, then flatten or drop. The "sweet spot" is usually in that flat or peak region. Use GridSearchCV or a simple loop with cross_val_score to find it.
The star ★ is the new point. Drag the slider to change K. The circles show which neighbors vote. Watch the prediction change!
K=1 is noisy (one neighbor decides everything). K=9 may include too-far points. K=3 or K=5 is usually the sweet spot!
So far we assumed every one of the K neighbors gets one vote. In weighted kNN, closer neighbors get more influence than farther ones. A neighbor right next to the new point should matter more than one that barely made it into the "top K."
Weighted kNN means: "Not all neighbors are equal. The closer a neighbor is, the more we trust their vote." We assign a weight to each neighbor (usually based on distance) and use those weights when voting or averaging.
You're asking 5 people whether to get pizza or pasta. Two live next door (very close), three live down the street (farther). In uniform kNN, all 5 get one vote. In weighted kNN, the two next-door neighbors' votes count more—so if both say "pizza," their opinion can outweigh the three who said "pasta."
The most common choice is inverse distance: weight = 1 / distance. So:
Closer neighbors get a larger weight and therefore more say in the vote. For classification we sum the weights per class (instead of counting votes) and pick the class with the highest total weight. For regression we take the weighted average of the neighbors' target values.
If neighbor i is at distance di, its weight is wi = 1 / di. To avoid division by zero when a neighbor is at distance 0 (same point), sklearn and others often use a small epsilon or treat that point as having the largest weight.
weights='distance' when your K neighbors can be at very different distances—e.g. one is very close and four are farther. Letting the close one(s) dominate often improves predictions.weights='uniform' (default) when you want simplicity or when all K neighbors tend to be at similar distances. Uniform is also more stable if distances are noisy.weights='distance' gives a small accuracy boost.# Weighted kNN: closer neighbors count more knn_weighted = KNeighborsClassifier(n_neighbors=5, weights='distance') knn_weighted.fit(X_train_scaled, y_train) y_pred = knn_weighted.predict(X_test_scaled)
Anomaly detection means finding points that don't "belong"—they're unusual compared to the rest. Fraudulent transactions, faulty sensors, rare diseases, or defective products are examples. kNN is a natural fit: points that are far from their neighbors are potential anomalies.
kNN for anomaly detection means: "If a point has no close neighbors, it's suspicious. Normal points sit near others; anomalies sit alone." We use the distance to the Kth nearest neighbor (or the average distance to K neighbors) as an anomaly score: the larger the distance, the more likely the point is an anomaly.
Imagine a playground. Most kids are in clusters—near friends. One kid is standing far from everyone. That kid is an anomaly. We don't need labels (who's "normal" or "weird"); we just notice: "This point's nearest neighbors are very far away." In kNN terms: for that kid, the distance to the 5th nearest kid is huge compared to other kids. So we flag them as different.
Using the Kth neighbor (not the 1st) makes the score more stable: one random nearby point doesn't hide an otherwise isolated point. So we ask: "How far do I have to go to find K friends?" If you have to go very far, you're an outlier.
from sklearn.neighbors import NearestNeighbors from sklearn.preprocessing import StandardScaler import numpy as np # Scale the data scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # Fit KNN: we only need distances, not labels K = 5 nn = NearestNeighbors(n_neighbors=K) nn.fit(X_scaled) # Distance to Kth neighbor for each point (anomaly score) distances, indices = nn.kneighbors(X_scaled) # distances[:, -1] = distance to the Kth (last) neighbor anomaly_score = distances[:, -1] # Flag top 5% as anomalies (e.g. threshold = 95th percentile) threshold = np.percentile(anomaly_score, 95) is_anomaly = anomaly_score > threshold print(f"Number of anomalies: {is_anomaly.sum()}")
Alternative: scikit-learn's LocalOutlierFactor (LOF) is a kNN-based anomaly detector that compares each point's local density to its neighbors'. Use from sklearn.neighbors import LocalOutlierFactor and fit_predict(X) for a -1 (anomaly) / 1 (normal) label per point.
kNN uses distance. If one feature is in thousands (e.g. income) and another is 0–10 (e.g. satisfaction score), the big numbers will dominate and the small feature will barely matter.
Use StandardScaler or MinMaxScaler so every feature is on a similar scale (e.g. mean 0, std 1, or 0–1). Otherwise "nearest" is decided almost only by the feature with the largest range.
Comparing "height in cm" (e.g. 170) and "number of siblings" (e.g. 2). Without scaling, a difference of 10 in height would swamp a difference of 2 in siblings. After scaling, both contribute fairly to distance.
Using scikit-learn: load data, split, scale, fit, predict, and (optionally) find the best K.
from sklearn.neighbors import KNeighborsClassifier from sklearn.model_selection import train_test_split, cross_val_score from sklearn.preprocessing import StandardScaler # 1. Split data X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # 2. Scale (important for kNN!) scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # 3. Create and fit kNN with K=5 knn = KNeighborsClassifier(n_neighbors=5) knn.fit(X_train_scaled, y_train) # 4. Predict and score y_pred = knn.predict(X_test_scaled) print("Accuracy:", knn.score(X_test_scaled, y_test))
from sklearn.model_selection import cross_val_score import matplotlib.pyplot as plt scores = [] k_range = range(1, 25) for k in k_range: knn = KNeighborsClassifier(n_neighbors=k) cv_scores = cross_val_score(knn, X_train_scaled, y_train, cv=5) scores.append(cv_scores.mean()) # Plot: often best K is in the middle plt.plot(k_range, scores) plt.xlabel('K') plt.ylabel('CV Accuracy') plt.title('Finding Optimal K') plt.show()
Weighted kNN: Use weights='distance' so closer neighbors count more (see Part 5: Weighted kNN). Default is weights='uniform' (one vote per neighbor).
When your features are binary (0/1) or one-hot encoded, use metric='hamming' so "nearest" means "fewest disagreeing positions":
# For binary or one-hot encoded features knn_hamming = KNeighborsClassifier(n_neighbors=5, metric='hamming') knn_hamming.fit(X_binary_train, y_train) y_pred = knn_hamming.predict(X_binary_test)
| Pros | Cons |
|---|---|
| Simple: no "training" step, easy to explain | Slow at prediction when dataset is huge (must compute distance to every point) |
| No assumptions about the shape of the data | Curse of dimensionality: in many dimensions, "nearest" neighbors can be far; accuracy drops |
| Works for classification and regression | Needs scaling; sensitive to irrelevant features |
| Good baseline; often works well for small–medium data | You must store the whole training set (memory) |
weights='distance' so closer neighbors have more influence (inverse distance); default is uniform (one vote each).NearestNeighbors and the distance to the Kth neighbor (or LocalOutlierFactor) to flag outliers.Next, learn about Support Vector Machines (SVM) - the maximum margin classifier with the kernel trick! Or jump to Decision Trees & Random Forests.