Math for Data Science (For Complete Beginners) | Fakhruddin Khambaty's Learning Hub

Chapter 1: What Even Is Data?

👶 Explain Like I'm 5

Data is just a fancy word for information written down as numbers or words.

Think of it like a list. When your mom writes a grocery list, that's data!

🏠 Real Life Example: Your Class Grades

Imagine you have test scores for your class:

Student	Math Score	English Score	Science Score
Ali	85	78	92
Sara	72	88	76
Ahmed	90	85	88
Fatima	65	70	75
Omar	78	82	80

That table above? That's DATA! Each row is one student. Each column is one type of score. Simple!

🤔 Why Do We Need Math for Data?

Imagine the teacher asks: "Who is the best student?"

You can't just look at the table and know immediately. You need to calculate something!

That's where math comes in - it helps us understand and summarize data!

📘 Course source

The course includes Maths for Data Science.pdf for a slide-style overview and a Practice Questions on Maths notebook for extra exercises.

Chapter 2: The MEAN (Average)

👶 Explain Like I'm 5

The mean (we also call it "average") answers this question:

"If everyone got the SAME score, what would that score be?"

It's like splitting a pizza equally among friends!

🍕 Pizza Analogy

You have 3 friends. One has 2 slices, one has 4 slices, one has 6 slices.

2 + 4 + 6 = 12 slices total

If you share equally among 3 friends:

12 ÷ 3 = 4 slices each

The AVERAGE is 4 slices per person!

📐 The Formula

Mean = Sum of all values ÷ Count of values

Just add everything up, then divide by how many things you added!

Let's Calculate Step by Step!

Let's find the average Math score from our class data:

📝 Follow These Steps

Write down all the Math scores

Ali: 85, Sara: 72, Ahmed: 90, Fatima: 65, Omar: 78

Add them all together

85 + 72 + 90 + 65 + 78 = 390

Count how many students

We have 5 students

Divide the sum by the count

390 ÷ 5 = 78

✅ Answer

The average Math score is:

This means if everyone scored the same, they'd all have 78!

Now Let's Do It in Python!

📝 Python Code - Copy and try it yourself!

# Step 1: Create a list of Math scores
math_scores = [85, 72, 90, 65, 78]

# Step 2: Add all scores together using sum()
total = sum(math_scores)
print("Sum of all scores:", total)
# Output: Sum of all scores: 390

# Step 3: Count how many scores using len()
count = len(math_scores)
print("Number of students:", count)
# Output: Number of students: 5

# Step 4: Divide to get the average
average = total / count
print("Average Math score:", average)
# Output: Average Math score: 78.0

# OR use the shortcut with NumPy library
import numpy as np
average = np.mean(math_scores)
print("Average (using NumPy):", average)
# Output: Average (using NumPy): 78.0

🔍 What does each line mean?

math_scores = [85, 72, 90, 65, 78] → Creates a list (like a container) holding all 5 scores
sum() → A built-in Python function that adds up everything in a list
len() → A built-in Python function that counts how many items are in a list
np.mean() → NumPy's shortcut function that calculates average in one step

Chapter 3: The MEDIAN (Middle Value)

👶 Explain Like I'm 5

The median is the middle person when everyone stands in a line from shortest to tallest!

If 5 kids stand in order of height, the median is the 3rd kid - the one in the exact middle!

👥 Standing in Line

Let's sort the Math scores from lowest to highest:

65, 72, 78, 85, 90

The middle number is:

That's the MEDIAN!

📝 How to Find the Median

Sort the numbers from smallest to largest

65, 72, 78, 85, 90 (we already sorted them!)

If ODD count: Pick the middle one

We have 5 numbers (odd). Middle position = (5+1)/2 = 3rd number = 78

If EVEN count: Average the two middle numbers

Example: If we had 4 numbers [65, 72, 78, 85], median = (72+78)/2 = 75

⚠️ Why Do We Need BOTH Mean and Median?

Mean can be tricked by extreme values!

Example: If Bill Gates walks into a room of 10 regular people, the AVERAGE wealth becomes billions! But the MEDIAN stays normal.

Python Code for Median

import numpy as np

math_scores = [85, 72, 90, 65, 78]

# Method 1: Manual way
sorted_scores = sorted(math_scores)  # Sort the list
print("Sorted:", sorted_scores)
# Output: Sorted: [65, 72, 78, 85, 90]

# Method 2: Using NumPy (easier!)
median = np.median(math_scores)
print("Median:", median)
# Output: Median: 78.0

Chapter 4: The MODE (Most Popular)

👶 Explain Like I'm 5

The mode is the most popular value - the one that appears the MOST times!

Like: What's the most popular ice cream flavor in your class?

🍦 Ice Cream Survey

Ask 10 kids their favorite flavor:

Chocolate, Vanilla, Chocolate, Strawberry, Chocolate, Vanilla, Chocolate, Mango, Vanilla, Chocolate

Chocolate appears 5 times (most!) → Chocolate is the MODE!

from scipy import stats

# Example: Shoe sizes sold today
shoe_sizes = [7, 8, 8, 9, 8, 10, 8, 7, 9, 8]

# Find the mode (most common value)
mode_result = stats.mode(shoe_sizes)
print("Most common shoe size:", mode_result.mode[0])
print("How many times it appears:", mode_result.count[0])
# Output: Most common shoe size: 8
# Output: How many times it appears: 5

📊 When to Use What?

Measure	What It Tells You	Best For
Mean	The "typical" value if distributed equally	Normal data without extreme values
Median	The actual middle value	Data with some extreme values (like salaries)
Mode	The most common value	Categories (like favorite color, shoe size)

Chapter 5: Variance & Standard Deviation (How Spread Out?)

👶 Explain Like I'm 5

Imagine two classes both have an average score of 75.

Class A: Everyone scored between 70-80 (very similar!)

Class B: Some scored 40, some scored 100 (very different!)

Variance and Standard Deviation tell us "how spread out are the scores?"

🎯 Two Dart Players

Both players have the same AVERAGE distance from bullseye...

🎯

Player A

All darts close together

Low Variance ✓

💥

Player B

Darts scattered everywhere

High Variance ✗

What is Variance?

Variance measures how far each number is from the mean, on average.

📐 Variance Formula (Don't Panic!)

Variance = Average of (each value - mean)²

We square the differences so negative numbers don't cancel out positive ones!

📝 Calculate Variance Step by Step

Using our Math scores: 85, 72, 90, 65, 78 (Mean = 78)

Find how far each score is from the mean (78)

85-78=7, 72-78=-6, 90-78=12, 65-78=-13, 78-78=0

Square each difference (multiply by itself)

7²=49, (-6)²=36, 12²=144, (-13)²=169, 0²=0

Add all squared differences

49 + 36 + 144 + 169 + 0 = 398

Divide by count (5) to get average

398 ÷ 5 = 79.6 → This is the VARIANCE!

What is Standard Deviation?

👶 Explain Like I'm 5

Standard Deviation is just the square root of variance!

Why? Because variance is in "squared units" which is weird. Standard deviation brings it back to normal units.

If scores are measured in points, standard deviation is also in points!

📐 Standard Deviation

Standard Deviation = √Variance = √79.6 ≈ 8.9

This means scores typically differ from the average by about 9 points!

🤔 Why Square Root? (Plain English)

Variance uses squared differences (so positives and negatives don't cancel). That gives a number in "squared units" (e.g. points²), which is hard to interpret. Taking the square root puts the spread back in the same units as your data (e.g. points). So we use standard deviation when we want to say things like "scores typically vary by about 9 points from the average."

📌 When to Use Mean vs Median: Quick Guide

Use the mean when your data doesn't have extreme outliers and you want the "typical" value in the sense of "total split equally." Use the median when you have outliers (e.g. income, house prices) or skewed data—the median is the "middle" and isn't pulled by a few extreme values. For reporting "average" in real life, if in doubt, show both!

import numpy as np

math_scores = [85, 72, 90, 65, 78]

# Calculate Variance
variance = np.var(math_scores)
print("Variance:", variance)
# Output: Variance: 79.6

# Calculate Standard Deviation
std_dev = np.std(math_scores)
print("Standard Deviation:", std_dev)
# Output: Standard Deviation: 8.92...

# Or calculate it manually
std_dev_manual = np.sqrt(variance)
print("Standard Deviation (manual):", std_dev_manual)
# Output: Standard Deviation (manual): 8.92...

Chapter 6: Percentiles & Quartiles (Where Do You Rank?)

👶 Explain Like I'm 5

Imagine 100 kids take a test. You got 85th percentile.

That means you did better than 85 out of 100 kids!

Only 15 kids did better than you. Pretty good!

📊 Important Percentiles (Quartiles)

Minimum

25%

50%

MEDIAN (Q2)

75%

100%

Maximum

Quartile	Percentile	Meaning
Q1 (First Quartile)	25th	25% of data is below this value
Q2 (Second Quartile)	50th	50% of data is below this value (same as MEDIAN!)
Q3 (Third Quartile)	75th	75% of data is below this value

import numpy as np

math_scores = [85, 72, 90, 65, 78]

# Calculate percentiles
q1 = np.percentile(math_scores, 25)  # 25th percentile
q2 = np.percentile(math_scores, 50)  # 50th percentile (median)
q3 = np.percentile(math_scores, 75)  # 75th percentile

print("Q1 (25th percentile):", q1)
print("Q2 (50th percentile):", q2)
print("Q3 (75th percentile):", q3)

# Output:
# Q1 (25th percentile): 72.0
# Q2 (50th percentile): 78.0
# Q3 (75th percentile): 85.0

Chapter 7: Outliers (The Weird Ones)

👶 Explain Like I'm 5

An outlier is a value that is WAY different from the others.

Like if everyone's age is 10, 11, 12, 10, 11... and then suddenly someone is 97!

That 97 is an outlier - it doesn't belong with the rest!

💰 Real Example: House Prices

Houses in a neighborhood: $200K, $220K, $180K, $210K, $190K, $5,000,000

That $5 million mansion is an OUTLIER! It will mess up our average badly.

How to Find Outliers (IQR Method)

📐 The IQR Method

IQR = Q3 - Q1

IQR = Interquartile Range (the middle 50% of data)

Lower Limit = Q1 - 1.5 × IQR

Upper Limit = Q3 + 1.5 × IQR

Anything below Lower Limit or above Upper Limit is an OUTLIER!

import numpy as np
import pandas as pd

# Example: House prices (in thousands)
prices = [200, 220, 180, 210, 190, 5000]  # Note: 5000 looks suspicious!

# Step 1: Calculate Q1 and Q3
Q1 = np.percentile(prices, 25)
Q3 = np.percentile(prices, 75)
print("Q1:", Q1, "Q3:", Q3)

# Step 2: Calculate IQR
IQR = Q3 - Q1
print("IQR:", IQR)

# Step 3: Calculate limits
lower_limit = Q1 - 1.5 * IQR
upper_limit = Q3 + 1.5 * IQR
print("Lower limit:", lower_limit)
print("Upper limit:", upper_limit)

# Step 4: Find outliers
for price in prices:
    if price < lower_limit or price > upper_limit:
        print(f"OUTLIER FOUND: {price}")

# Output:
# Q1: 187.5 Q3: 217.5
# IQR: 30.0
# Lower limit: 142.5
# Upper limit: 262.5
# OUTLIER FOUND: 5000

Chapter 8: Normalization (Making Things Fair)

👶 Explain Like I'm 5

Imagine comparing a person's height (in cm) and weight (in kg).

Height: 180 cm. Weight: 75 kg.

180 is bigger than 75, but does that mean height is "more" than weight? NO!

Normalization puts everything on the SAME scale so we can compare fairly!

Method 1: Min-Max Normalization (Scale 0 to 1)

📐 Min-Max Formula

Scaled Value = (Value - Min) / (Max - Min)

This squishes ALL values to be between 0 and 1!

from sklearn.preprocessing import MinMaxScaler
import numpy as np

# Original data: heights (very different scale from weights!)
heights = [[150], [160], [170], [180], [190]]

# Create the scaler
scaler = MinMaxScaler()

# Fit and transform
heights_scaled = scaler.fit_transform(heights)

print("Original heights:", [h[0] for h in heights])
print("Scaled heights:", [round(h[0], 2) for h in heights_scaled])

# Output:
# Original heights: [150, 160, 170, 180, 190]
# Scaled heights: [0.0, 0.25, 0.5, 0.75, 1.0]

Method 2: Z-Score Normalization (Mean = 0, Std = 1)

📐 Z-Score Formula

Z = (Value - Mean) / Standard Deviation

This tells you "how many standard deviations away from average?"

from sklearn.preprocessing import StandardScaler

# Same heights data
heights = [[150], [160], [170], [180], [190]]

# Create the Z-score scaler
scaler = StandardScaler()

# Fit and transform
heights_zscore = scaler.fit_transform(heights)

print("Original heights:", [h[0] for h in heights])
print("Z-scores:", [round(h[0], 2) for h in heights_zscore])

# Output:
# Original heights: [150, 160, 170, 180, 190]
# Z-scores: [-1.41, -0.71, 0.0, 0.71, 1.41]
# Notice: 170 (the average) becomes 0!

🚫 Common Mistakes in Descriptive Math

Using only the mean when you have outliers — One extreme value can drag the mean; use the median for a more typical "center."
Confusing variance with standard deviation — Variance is in squared units; SD is in the same units as your data (we use SD to interpret "spread").
Forgetting to scale before comparing features — If one column is in thousands and another in 0–10, normalize (e.g. z-score or min-max) so comparisons are fair.

💭 Short reflection

In one sentence: why is standard deviation more interpretable than variance when describing “how spread out” data is?

✅ CORE (Must know)

Mean, median, mode: center of data; median robust to outliers.
Variance & standard deviation: spread; SD in same units as data.
Percentiles/quartiles: where a value sits in the distribution.
Normalization: scale features (e.g. min-max, z-score) for fair comparison and ML.

📚 NON-CORE (Good to know)

Skewness, kurtosis; correlation vs causation.

Chapter 9: Summary - What Did We Learn?

Concept	What It Answers	Simple Explanation
Mean	"What's the typical value?"	Add everything up, divide by count
Median	"What's in the middle?"	Sort the values, pick the middle one
Mode	"What's most popular?"	The value that appears most often
Variance	"How spread out is the data?"	Average of squared differences from mean
Standard Deviation	"Typical distance from average?"	Square root of variance
Percentile/Quartile	"Where do I rank?"	What % of data is below this value
Outlier	"Is this value weird?"	Values way outside the normal range
Normalization	"How to compare fairly?"	Put all features on the same scale

🎉 Congratulations!

You now know the basic math needed for Data Science!

These concepts will be used in EVERY machine learning project!

Back to Course Hub Next: Statistics Foundations

📐 Math for Data Science

Chapter 1: What Even Is Data?

👶 Explain Like I'm 5

🏠 Real Life Example: Your Class Grades

🤔 Why Do We Need Math for Data?

📘 Course source

Chapter 2: The MEAN (Average)

👶 Explain Like I'm 5

🍕 Pizza Analogy

📐 The Formula

Let's Calculate Step by Step!

📝 Follow These Steps

Write down all the Math scores

Add them all together

Count how many students

Divide the sum by the count

✅ Answer

Now Let's Do It in Python!

🔍 What does each line mean?

Chapter 3: The MEDIAN (Middle Value)

👶 Explain Like I'm 5

👥 Standing in Line

📝 How to Find the Median

Sort the numbers from smallest to largest

If ODD count: Pick the middle one

If EVEN count: Average the two middle numbers

⚠️ Why Do We Need BOTH Mean and Median?

Python Code for Median

Chapter 4: The MODE (Most Popular)

👶 Explain Like I'm 5

🍦 Ice Cream Survey

📊 When to Use What?

Chapter 5: Variance & Standard Deviation (How Spread Out?)

👶 Explain Like I'm 5

🎯 Two Dart Players

What is Variance?

📐 Variance Formula (Don't Panic!)

📝 Calculate Variance Step by Step

Find how far each score is from the mean (78)

Square each difference (multiply by itself)

Add all squared differences

Divide by count (5) to get average

What is Standard Deviation?

👶 Explain Like I'm 5

📐 Standard Deviation

🤔 Why Square Root? (Plain English)

📌 When to Use Mean vs Median: Quick Guide

Chapter 6: Percentiles & Quartiles (Where Do You Rank?)

👶 Explain Like I'm 5

📊 Important Percentiles (Quartiles)

Chapter 7: Outliers (The Weird Ones)

👶 Explain Like I'm 5

💰 Real Example: House Prices

How to Find Outliers (IQR Method)

📐 The IQR Method

Chapter 8: Normalization (Making Things Fair)

👶 Explain Like I'm 5

Method 1: Min-Max Normalization (Scale 0 to 1)

📐 Min-Max Formula

Method 2: Z-Score Normalization (Mean = 0, Std = 1)

📐 Z-Score Formula

🚫 Common Mistakes in Descriptive Math

💭 Short reflection

✅ CORE (Must know)

📚 NON-CORE (Good to know)

Chapter 9: Summary - What Did We Learn?

🎉 Congratulations!