Don't worry if you hate math! We'll explain everything like you're 5 years old. No prior knowledge needed - we start from ZERO!
Data is just a fancy word for information written down as numbers or words.
Think of it like a list. When your mom writes a grocery list, that's data!
Imagine you have test scores for your class:
| Student | Math Score | English Score | Science Score |
|---|---|---|---|
| Ali | 85 | 78 | 92 |
| Sara | 72 | 88 | 76 |
| Ahmed | 90 | 85 | 88 |
| Fatima | 65 | 70 | 75 |
| Omar | 78 | 82 | 80 |
That table above? That's DATA! Each row is one student. Each column is one type of score. Simple!
Imagine the teacher asks: "Who is the best student?"
You can't just look at the table and know immediately. You need to calculate something!
That's where math comes in - it helps us understand and summarize data!
The course includes Maths for Data Science.pdf for a slide-style overview and a Practice Questions on Maths notebook for extra exercises.
The mean (we also call it "average") answers this question:
"If everyone got the SAME score, what would that score be?"
It's like splitting a pizza equally among friends!
You have 3 friends. One has 2 slices, one has 4 slices, one has 6 slices.
If you share equally among 3 friends:
The AVERAGE is 4 slices per person!
Just add everything up, then divide by how many things you added!
Let's find the average Math score from our class data:
Ali: 85, Sara: 72, Ahmed: 90, Fatima: 65, Omar: 78
85 + 72 + 90 + 65 + 78 = 390
We have 5 students
390 Γ· 5 = 78
The average Math score is:
This means if everyone scored the same, they'd all have 78!
# Step 1: Create a list of Math scores math_scores = [85, 72, 90, 65, 78] # Step 2: Add all scores together using sum() total = sum(math_scores) print("Sum of all scores:", total) # Output: Sum of all scores: 390 # Step 3: Count how many scores using len() count = len(math_scores) print("Number of students:", count) # Output: Number of students: 5 # Step 4: Divide to get the average average = total / count print("Average Math score:", average) # Output: Average Math score: 78.0 # OR use the shortcut with NumPy library import numpy as np average = np.mean(math_scores) print("Average (using NumPy):", average) # Output: Average (using NumPy): 78.0
math_scores = [85, 72, 90, 65, 78] β Creates a list (like a container) holding all 5 scoressum() β A built-in Python function that adds up everything in a listlen() β A built-in Python function that counts how many items are in a listnp.mean() β NumPy's shortcut function that calculates average in one stepThe median is the middle person when everyone stands in a line from shortest to tallest!
If 5 kids stand in order of height, the median is the 3rd kid - the one in the exact middle!
Let's sort the Math scores from lowest to highest:
The middle number is:
That's the MEDIAN!
65, 72, 78, 85, 90 (we already sorted them!)
We have 5 numbers (odd). Middle position = (5+1)/2 = 3rd number = 78
Example: If we had 4 numbers [65, 72, 78, 85], median = (72+78)/2 = 75
Mean can be tricked by extreme values!
Example: If Bill Gates walks into a room of 10 regular people, the AVERAGE wealth becomes billions! But the MEDIAN stays normal.
import numpy as np math_scores = [85, 72, 90, 65, 78] # Method 1: Manual way sorted_scores = sorted(math_scores) # Sort the list print("Sorted:", sorted_scores) # Output: Sorted: [65, 72, 78, 85, 90] # Method 2: Using NumPy (easier!) median = np.median(math_scores) print("Median:", median) # Output: Median: 78.0
The mode is the most popular value - the one that appears the MOST times!
Like: What's the most popular ice cream flavor in your class?
Ask 10 kids their favorite flavor:
Chocolate, Vanilla, Chocolate, Strawberry, Chocolate, Vanilla, Chocolate, Mango, Vanilla, Chocolate
Chocolate appears 5 times (most!) β Chocolate is the MODE!
from scipy import stats # Example: Shoe sizes sold today shoe_sizes = [7, 8, 8, 9, 8, 10, 8, 7, 9, 8] # Find the mode (most common value) mode_result = stats.mode(shoe_sizes) print("Most common shoe size:", mode_result.mode[0]) print("How many times it appears:", mode_result.count[0]) # Output: Most common shoe size: 8 # Output: How many times it appears: 5
| Measure | What It Tells You | Best For |
|---|---|---|
| Mean | The "typical" value if distributed equally | Normal data without extreme values |
| Median | The actual middle value | Data with some extreme values (like salaries) |
| Mode | The most common value | Categories (like favorite color, shoe size) |
Imagine two classes both have an average score of 75.
Class A: Everyone scored between 70-80 (very similar!)
Class B: Some scored 40, some scored 100 (very different!)
Variance and Standard Deviation tell us "how spread out are the scores?"
Both players have the same AVERAGE distance from bullseye...
π―
Player A
All darts close together
Low Variance β
π₯
Player B
Darts scattered everywhere
High Variance β
Variance measures how far each number is from the mean, on average.
We square the differences so negative numbers don't cancel out positive ones!
Using our Math scores: 85, 72, 90, 65, 78 (Mean = 78)
85-78=7, 72-78=-6, 90-78=12, 65-78=-13, 78-78=0
7Β²=49, (-6)Β²=36, 12Β²=144, (-13)Β²=169, 0Β²=0
49 + 36 + 144 + 169 + 0 = 398
398 Γ· 5 = 79.6 β This is the VARIANCE!
Standard Deviation is just the square root of variance!
Why? Because variance is in "squared units" which is weird. Standard deviation brings it back to normal units.
If scores are measured in points, standard deviation is also in points!
This means scores typically differ from the average by about 9 points!
Variance uses squared differences (so positives and negatives don't cancel). That gives a number in "squared units" (e.g. pointsΒ²), which is hard to interpret. Taking the square root puts the spread back in the same units as your data (e.g. points). So we use standard deviation when we want to say things like "scores typically vary by about 9 points from the average."
Use the mean when your data doesn't have extreme outliers and you want the "typical" value in the sense of "total split equally." Use the median when you have outliers (e.g. income, house prices) or skewed dataβthe median is the "middle" and isn't pulled by a few extreme values. For reporting "average" in real life, if in doubt, show both!
import numpy as np math_scores = [85, 72, 90, 65, 78] # Calculate Variance variance = np.var(math_scores) print("Variance:", variance) # Output: Variance: 79.6 # Calculate Standard Deviation std_dev = np.std(math_scores) print("Standard Deviation:", std_dev) # Output: Standard Deviation: 8.92... # Or calculate it manually std_dev_manual = np.sqrt(variance) print("Standard Deviation (manual):", std_dev_manual) # Output: Standard Deviation (manual): 8.92...
Imagine 100 kids take a test. You got 85th percentile.
That means you did better than 85 out of 100 kids!
Only 15 kids did better than you. Pretty good!
0%
Minimum
25%
Q1
50%
MEDIAN (Q2)
75%
Q3
100%
Maximum
| Quartile | Percentile | Meaning |
|---|---|---|
| Q1 (First Quartile) | 25th | 25% of data is below this value |
| Q2 (Second Quartile) | 50th | 50% of data is below this value (same as MEDIAN!) |
| Q3 (Third Quartile) | 75th | 75% of data is below this value |
import numpy as np math_scores = [85, 72, 90, 65, 78] # Calculate percentiles q1 = np.percentile(math_scores, 25) # 25th percentile q2 = np.percentile(math_scores, 50) # 50th percentile (median) q3 = np.percentile(math_scores, 75) # 75th percentile print("Q1 (25th percentile):", q1) print("Q2 (50th percentile):", q2) print("Q3 (75th percentile):", q3) # Output: # Q1 (25th percentile): 72.0 # Q2 (50th percentile): 78.0 # Q3 (75th percentile): 85.0
An outlier is a value that is WAY different from the others.
Like if everyone's age is 10, 11, 12, 10, 11... and then suddenly someone is 97!
That 97 is an outlier - it doesn't belong with the rest!
Houses in a neighborhood: $200K, $220K, $180K, $210K, $190K, $5,000,000
That $5 million mansion is an OUTLIER! It will mess up our average badly.
IQR = Interquartile Range (the middle 50% of data)
Anything below Lower Limit or above Upper Limit is an OUTLIER!
import numpy as np import pandas as pd # Example: House prices (in thousands) prices = [200, 220, 180, 210, 190, 5000] # Note: 5000 looks suspicious! # Step 1: Calculate Q1 and Q3 Q1 = np.percentile(prices, 25) Q3 = np.percentile(prices, 75) print("Q1:", Q1, "Q3:", Q3) # Step 2: Calculate IQR IQR = Q3 - Q1 print("IQR:", IQR) # Step 3: Calculate limits lower_limit = Q1 - 1.5 * IQR upper_limit = Q3 + 1.5 * IQR print("Lower limit:", lower_limit) print("Upper limit:", upper_limit) # Step 4: Find outliers for price in prices: if price < lower_limit or price > upper_limit: print(f"OUTLIER FOUND: {price}") # Output: # Q1: 187.5 Q3: 217.5 # IQR: 30.0 # Lower limit: 142.5 # Upper limit: 262.5 # OUTLIER FOUND: 5000
Imagine comparing a person's height (in cm) and weight (in kg).
Height: 180 cm. Weight: 75 kg.
180 is bigger than 75, but does that mean height is "more" than weight? NO!
Normalization puts everything on the SAME scale so we can compare fairly!
This squishes ALL values to be between 0 and 1!
from sklearn.preprocessing import MinMaxScaler import numpy as np # Original data: heights (very different scale from weights!) heights = [[150], [160], [170], [180], [190]] # Create the scaler scaler = MinMaxScaler() # Fit and transform heights_scaled = scaler.fit_transform(heights) print("Original heights:", [h[0] for h in heights]) print("Scaled heights:", [round(h[0], 2) for h in heights_scaled]) # Output: # Original heights: [150, 160, 170, 180, 190] # Scaled heights: [0.0, 0.25, 0.5, 0.75, 1.0]
This tells you "how many standard deviations away from average?"
from sklearn.preprocessing import StandardScaler # Same heights data heights = [[150], [160], [170], [180], [190]] # Create the Z-score scaler scaler = StandardScaler() # Fit and transform heights_zscore = scaler.fit_transform(heights) print("Original heights:", [h[0] for h in heights]) print("Z-scores:", [round(h[0], 2) for h in heights_zscore]) # Output: # Original heights: [150, 160, 170, 180, 190] # Z-scores: [-1.41, -0.71, 0.0, 0.71, 1.41] # Notice: 170 (the average) becomes 0!
In one sentence: why is standard deviation more interpretable than variance when describing βhow spread outβ data is?
| Concept | What It Answers | Simple Explanation |
|---|---|---|
| Mean | "What's the typical value?" | Add everything up, divide by count |
| Median | "What's in the middle?" | Sort the values, pick the middle one |
| Mode | "What's most popular?" | The value that appears most often |
| Variance | "How spread out is the data?" | Average of squared differences from mean |
| Standard Deviation | "Typical distance from average?" | Square root of variance |
| Percentile/Quartile | "Where do I rank?" | What % of data is below this value |
| Outlier | "Is this value weird?" | Values way outside the normal range |
| Normalization | "How to compare fairly?" | Put all features on the same scale |
You now know the basic math needed for Data Science!
These concepts will be used in EVERY machine learning project!