Statistics sounds scary, but it's just asking questions about data! We'll make it super simple with everyday examples.
Statistics is like being a detective with numbers! ๐
You have clues (data), and statistics helps you figure out what they mean!
Statistics is the set of tools we use to summarize data (descriptive) and to make conclusions or predictions from samples (inferential)โincluding when to trust that a pattern isn't just luck (hypothesis tests, p-values).
A pizza restaurant wants to know: "Which pizza is most popular?"
They can't ask every single customer ever. So they look at last month's orders (that's their sample) and use statistics to guess what ALL customers like!
What it does: Summarizes data you already have
Example: "Last month, 60% of orders were pepperoni pizza"
Tools: Mean, Median, Mode, Charts
What it does: Makes predictions about data you DON'T have
Example: "We predict next month will also be ~60% pepperoni"
Tools: Hypothesis tests, Confidence intervals
Imagine measuring the height of 1000 adults:
When you graph this, it makes a bell shape! ๐
Graph below: X-axis = value (e.g. height, test score); Y-axis = how often that value appears (frequency). The peak is at the mean.
X = value, Y = frequency. Red dashed line = mean (center).
Few MOST ARE HERE Few
Short (Average) Tall
This shape appears EVERYWHERE in nature!
| Distance from Mean | % of Data |
|---|---|
| Within 1 Standard Deviation | 68% |
| Within 2 Standard Deviations | 95% |
| Within 3 Standard Deviations | 99.7% |
If average test score is 75 and standard deviation is 10:
If someone scores 20, that's VERY unusual! (an outlier)
import numpy as np import matplotlib.pyplot as plt # Generate 10,000 random numbers from a normal distribution # mean = 75 (average test score) # std = 10 (how spread out the scores are) np.random.seed(42) # For reproducibility test_scores = np.random.normal(loc=75, scale=10, size=10000) # Let's verify the 68-95-99.7 rule! mean = np.mean(test_scores) std = np.std(test_scores) # Count how many fall within 1, 2, 3 standard deviations within_1_std = np.sum((test_scores >= mean - std) & (test_scores <= mean + std)) within_2_std = np.sum((test_scores >= mean - 2*std) & (test_scores <= mean + 2*std)) within_3_std = np.sum((test_scores >= mean - 3*std) & (test_scores <= mean + 3*std)) print(f"Within 1 std: {within_1_std/100:.1f}% (expected: 68%)") print(f"Within 2 std: {within_2_std/100:.1f}% (expected: 95%)") print(f"Within 3 std: {within_3_std/100:.1f}% (expected: 99.7%)") # Output: # Within 1 std: 68.2% (expected: 68%) # Within 2 std: 95.4% (expected: 95%) # Within 3 std: 99.7% (expected: 99.7%)
The course source uses np.random.seed(10) and original_data = np.random.normal(size=10000) (mean 0, variance 1 by default). Then len(original_data) gives 10000. To plot the shape, the notebook uses sns.distplot(original_data, hist=False, kde=True) for a smooth curve and sns.distplot(original_data, hist=True, kde=False) for a histogramโboth show the bell shape. Loading Excel: data = pd.read_excel("titanic3.xlsx") (download titanic3.xlsx from the datasets page). The course also includes Statistics.pdf for a slide-style summary.
Correlation tells us: "When one thing changes, does the other thing also change?"
Like: When it's sunny โ๏ธ, do ice cream sales go up? ๐ฆ
When X goes UP, Y goes UP too!
Examples:
When X goes UP, Y goes DOWN!
Examples:
X and Y don't affect each other!
Examples:
Just because two things move together doesn't mean one CAUSES the other!
Example: Ice cream sales and drowning deaths are correlated. Does ice cream cause drowning? NO!
Both go up in summer because of the heat - that's the real cause!
import pandas as pd import numpy as np # Create sample data: Study hours vs Test scores data = { 'study_hours': [1, 2, 3, 4, 5, 6, 7, 8], 'test_score': [50, 55, 60, 65, 70, 75, 80, 85] } df = pd.DataFrame(data) # Calculate correlation correlation = df['study_hours'].corr(df['test_score']) print(f"Correlation: {correlation:.2f}") # Output: Correlation: 1.00 # Perfect positive correlation! More study = Higher score # See correlation matrix for all columns print("\nCorrelation Matrix:") print(df.corr()) # Output: # study_hours test_score # study_hours 1.0 1.0 # test_score 1.0 1.0
Hotel Reservations (Chi-Square & T-Test), Housing, Titanic, and Practice exercises.
The Chi-Square test answers: "Is there a relationship between two CATEGORIES?"
Like: Is there a connection between Gender (Male/Female) and Favorite Color (Red/Blue)?
We want to know: "Does the type of MEAL PLAN affect whether people CANCEL their booking?"
Both variables are categories:
Count how many fall into each combination of categories
If there was NO relationship, what would we expect?
Are they very different? If yes โ There's a relationship!
p < 0.05 means "Yes, there IS a relationship!"
| Canceled | Not Canceled | Total | |
|---|---|---|---|
| Meal Plan 1 | 8,679 | 19,156 | 27,835 |
| Meal Plan 2 | 1,506 | 1,799 | 3,305 |
| Not Selected | 1,699 | 3,431 | 5,130 |
import pandas as pd from scipy.stats import chi2_contingency # Load the hotel data data = pd.read_csv("Hotel Reservations.csv") # Step 1: Create contingency table # pd.crosstab counts combinations of two categorical variables table = pd.crosstab(data['type_of_meal_plan'], data['booking_status']) print("Contingency Table:") print(table) # Step 2: Run Chi-Square test chi2, p_value, dof, expected = chi2_contingency(table) print(f"\nChi-Square Statistic: {chi2:.2f}") print(f"p-value: {p_value}") # Step 3: Interpret the result if p_value < 0.05: print("โ YES! Meal plan IS related to cancellation!") else: print("โ NO relationship between meal plan and cancellation") # Output: # Chi-Square Statistic: 271.45 # p-value: 4.477e-61 (that's TINY!) # โ YES! Meal plan IS related to cancellation!
p-value = Probability of getting this result by pure chance
If p-value < 0.05 (less than 5% chance), we say "This is NOT just luck - there's a real relationship!"
Our p-value was 0.0000000000000...0004477 - TINY! So definitely not luck!
The T-Test answers: "Are these two groups REALLY different, or is it just random chance?"
Example: Do customers who book early (high lead time) cancel more than those who book late?
Use when BOTH variables are categories
Example: Gender vs Favorite Color
(Male/Female) vs (Red/Blue/Green)
Use when comparing a NUMBER across categories
Example: Lead Time vs Booking Status
(Days: 10, 20, 30...) vs (Canceled/Not Canceled)
Lead time = How many days before arrival did they book?
We have two groups:
If the averages are VERY different โ Lead time matters for cancellation!
import pandas as pd from scipy import stats # Load data data = pd.read_csv("Hotel Reservations.csv") # Step 1: Separate the two groups canceled = data[data['booking_status'] == 'Canceled'] not_canceled = data[data['booking_status'] == 'Not_Canceled'] # Step 2: Get the lead_time for each group lead_time_canceled = canceled['lead_time'] lead_time_not_canceled = not_canceled['lead_time'] # Let's see the averages first print("Average lead time for CANCELED bookings:", lead_time_canceled.mean().round(1)) print("Average lead time for NOT CANCELED:", lead_time_not_canceled.mean().round(1)) # Step 3: Run the T-Test t_stat, p_value = stats.ttest_ind(lead_time_canceled, lead_time_not_canceled) print(f"\nT-Statistic: {t_stat:.2f}") print(f"p-value: {p_value}") # Step 4: Interpret if p_value < 0.05: print("โ Lead time IS significantly different between groups!") else: print("โ No significant difference in lead time") # Output: # Average lead time for CANCELED bookings: 110.5 days # Average lead time for NOT CANCELED: 74.2 days # T-Statistic: 45.67 # p-value: 0.0 # โ Lead time IS significantly different between groups!
People who CANCELED booked ~110 days in advance (on average)
People who DIDN'T cancel booked ~74 days in advance
Conclusion: People who book too early are more likely to cancel!
(Maybe their plans change over time)
Every line of code (verbatim).
# --- Code cell 1 ---
import warnings
warnings.filterwarnings("ignore")
# import modules
import numpy as np
import pandas as pd
# plotting modules
import seaborn as sns
import matplotlib.pyplot as plt
# --- Code cell 2 ---
from IPython.core.display import HTML
HTML("""
<style>
h1 { color: Purple !important; }
h2 { color: green !important; }
h3 { color: blue !important; }
</style>
""")
# --- Code cell 6 ---
np.random.seed(10)
original_data = np.random.normal(size = 10000) # by default mean is zero and variance is 1
# --- Code cell 7 ---
len(original_data)
# --- Code cell 8 ---
# creating axes to draw plots
fig, ax = plt.subplots(1, 2,figsize = (30,10))
sns.distplot(original_data, hist = False, kde = True,
kde_kws = {'shade': True, 'linewidth': 2},
color ="green", ax = ax[0])
sns.distplot(original_data, hist = True, kde = False,
kde_kws = {'shade': True, 'linewidth': 2},
color ="green", ax = ax[1])
# rescaling the subplots
fig.set_figheight(5)
fig.set_figwidth(10)
# --- Code cell 13 ---
#conda install xlrd / pip install xlrd ( in anconda comand prompt)
data = pd.read_excel("titanic3.xlsx")
# --- Code cell 14 ---
data.head(10)
# --- Code cell 15 ---
print(data.isnull().sum()) # add example of percentage
# --- Code cell 16 ---
print(data.isnull().sum()*100/len(data))
# --- Code cell 17 ---
#Drop columns with IDs and large number of missing values
data.drop(["name", "ticket", "cabin","boat","body","home.dest"],axis=1,inplace=True)
# --- Code cell 18 ---
data.describe()
# --- Code cell 19 ---
data.describe(include='all')
# --- Code cell 20 ---
data['embarked'].value_counts()
# --- Code cell 21 ---
data['embarked'].fillna('S',inplace=True)
# --- Code cell 22 ---
data['embarked'].value_counts()
# --- Code cell 23 ---
import statistics
age_variance = statistics.variance(data['age'].dropna())
print("Variance of Age: ", age_variance)
# --- Code cell 24 ---
std_dev_age = statistics.sqrt(age_variance)
print("Standard deviation of Age: ", std_dev_age)
# --- Code cell 25 ---
sns.distplot(data['age'], hist = True, kde = False,color ="green")
# --- Code cell 27 ---
print("mean age before imputation: ", data['age'].mean())
#print("\n")
data['age'].fillna(data['age'].median(),inplace=True)
print("mean age after imputation: ", data['age'].mean())
print(data.isnull().sum())
# --- Code cell 28 ---
sns.distplot(data['age'], hist = True, kde = False,color ="green")
# --- Code cell 29 ---
#Another options:
#Drop columns with missing data 60%
#Fit regresison model to predict missing age data
# Always check accuracy of main task after imputation
# Take business logic into consideration for data imputation
# --- Code cell 35 ---
data = pd.read_csv("Housing.csv")
# --- Code cell 36 ---
data = data[['price','area']]
data.describe()
# --- Code cell 37 ---
sns.boxplot(data['area']).set(xlabel= 'area')
# --- Code cell 38 ---
# outlier treatment for area
# data['area'] / data.area
Q1 = data.area.quantile(0.25)
Q3 = data.area.quantile(0.75)
IQR = Q3 - Q1
print("Q1 value: ", Q1)
print("Q3 value: ", Q3)
print("IQR value: ", Q3 - Q1)
# --- Code cell 39 ---
print("Lower threshold of outlier value: ", Q1 - 1.5*IQR)
print("Upper threshold of outlier value: ", Q3 + 1.5*IQR)
# --- Code cell 40 ---
#3650
# --- Code cell 41 ---
data = data[(data.area >= Q1 - 1.5*IQR) & (data.area <= Q3 + 1.5*IQR)]
data.describe()
# --- Code cell 44 ---
data.describe()
# --- Code cell 46 ---
# 0 to 1 scaling
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
data_scaled = pd.DataFrame(scaler.fit_transform(data))
data_scaled.columns = ['price','area']
data_scaled.head(10)
# --- Code cell 47 ---
data_scaled.describe()
# --- Code cell 50 ---
data.describe()
# --- Code cell 52 ---
# z score scaling
# 0 mean and variance /standard deviation 1
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data_scaled = pd.DataFrame(scaler.fit_transform(data))
data_scaled.columns = ['price','area']
data_scaled.head(10)
# --- Code cell 53 ---
# if you note - mean will is zero and standard deviation is 1.
# Min and max vlaue is based on the data after applying transformation
data_scaled.describe()
# --- Code cell 56 ---
data.head(10)
# --- Code cell 57 ---
sns.scatterplot(data=data, x="area", y="price").set(xlabel= "area",ylabel="price")
# --- Code cell 58 ---
#Correlation of output with numerical variables
# plotting correlation heatmap
dataplot = sns.heatmap(data[['price', 'area']].corr(), cmap="YlGnBu", annot=True)
# displaying heatmap
plt.show()
In one sentence: why is it wrong to say โcorrelation means causationโ? Give a real-life example where two things are correlated but one doesnโt cause the other.
Master every core point for exams and real work. Non-core points deepen your statistical thinking.
| Question You Want to Answer | Test to Use | Example |
|---|---|---|
| Is there a relationship between two CATEGORIES? | Chi-Square | Gender vs Product Preference |
| Is a NUMBER different across two CATEGORIES? | T-Test | Salary of Men vs Women |
| Do two NUMBERS move together? | Correlation | Height vs Weight |
| What's the typical value in my data? | Mean/Median | Average house price |
| How spread out is my data? | Std Dev/Variance | Are test scores consistent? |
For ALL these tests, remember:
Think of p-value as asking: "What's the chance this happened by pure luck?"
If less than 5% chance โ It's probably NOT luck!
Green zone = significant (p < 0.05). Red zone = not significant. The line shows โwhere you areโ on the scale.