Session 11
π Pandas Part II
Advanced data analysis and manipulation
π 8 Topics
β±οΈ 55 min read
π― Advanced Level
πΊοΈ What You'll Learn
Filtering Data
GroupBy Operations
Sorting & Ranking
Aggregations
Apply & Lambda
Handling Missing Data
π Same topic in the course notebook
Session_11 Pandas Part II covers loc, iloc, filtering, describe(), groupbyβsame as here. Use the notebook to practice.
π Filtering Data
11.1
π― Boolean Indexing
Python
From Source
# Filtering from Session 11
import pandas as pd
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Age': [25, 30, 35, 28, 32],
'Salary': [50000, 60000, 70000, 55000, 65000],
'Department': ['HR', 'IT', 'IT', 'HR', 'Finance']
})
print("Original DataFrame:")
print(df)
# Filter by single condition
print("\\nAge > 28:")
print(df[df['Age'] > 28])
# Multiple conditions with & (and)
print("\\nAge > 25 AND Salary > 55000:")
print(df[(df['Age'] > 25) & (df['Salary'] > 55000)])
# Multiple conditions with | (or)
print("\\nDepartment is HR OR Salary > 65000:")
print(df[(df['Department'] == 'HR') | (df['Salary'] > 65000)])
# Using isin() for multiple values
print("\\nDepartment in ['HR', 'Finance']:")
print(df[df['Department'].isin(['HR', 'Finance'])])
Output
Original DataFrame:
Name Age Salary Department
0 Alice 25 50000 HR
1 Bob 30 60000 IT
2 Charlie 35 70000 IT
3 David 28 55000 HR
4 Eve 32 65000 Finance
Age > 28:
Name Age Salary Department
1 Bob 30 60000 IT
2 Charlie 35 70000 IT
4 Eve 32 65000 Finance
Age > 25 AND Salary > 55000:
Name Age Salary Department
1 Bob 30 60000 IT
2 Charlie 35 70000 IT
4 Eve 32 65000 Finance
Department is HR OR Salary > 65000:
Name Age Salary Department
0 Alice 25 50000 HR
2 Charlie 35 70000 IT
3 David 28 55000 HR
Department in ['HR', 'Finance']:
Name Age Salary Department
0 Alice 25 50000 HR
3 David 28 55000 HR
4 Eve 32 65000 Finance
π GroupBy Operations
11.2
ποΈ Split-Apply-Combine
π GroupBy Process
Original Data
β
βΌ
βββββββββββ
β SPLIT β Group by category
βββββββββββ
β
ββββββ΄βββββ
βΌ βΌ
HR IT
β
βΌ
βββββββββββ
β APPLY β Calculate (mean, sum, count...)
βββββββββββ
β
βΌ
βββββββββββ
β COMBINE β Merge results
βββββββββββ
Python
From Source
# GroupBy from Session 11
import pandas as pd
df = pd.DataFrame({
'Department': ['HR', 'IT', 'IT', 'HR', 'Finance'],
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Salary': [50000, 60000, 70000, 55000, 65000]
})
# Group by Department and calculate mean salary
print("Mean salary by Department:")
print(df.groupby('Department')['Salary'].mean())
# Multiple aggregations
print("\\nMultiple aggregations:")
print(df.groupby('Department')['Salary'].agg(['mean', 'sum', 'count', 'min', 'max']))
# Count employees per department
print("\\nEmployee count by Department:")
print(df.groupby('Department').size())
Output
Mean salary by Department:
Department
Finance 65000.0
HR 52500.0
IT 65000.0
Name: Salary, dtype: float64
Multiple aggregations:
mean sum count min max
Department
Finance 65000.0 65000 1 65000 65000
HR 52500.0 105000 2 50000 55000
IT 65000.0 130000 2 60000 70000
Employee count by Department:
Department
Finance 1
HR 2
IT 2
dtype: int64
π Sorting & Ranking
11.3
β¬οΈβ¬οΈ Ordering Data
Python
From Source
# Sorting from Session 11
import pandas as pd
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 28],
'Salary': [50000, 60000, 70000, 55000]
})
# Sort by single column
print("Sorted by Age (ascending):")
print(df.sort_values('Age'))
print("\\nSorted by Salary (descending):")
print(df.sort_values('Salary', ascending=False))
# nlargest and nsmallest
print("\\nTop 2 highest salaries:")
print(df.nlargest(2, 'Salary'))
print("\\n2 youngest employees:")
print(df.nsmallest(2, 'Age'))
Output
Sorted by Age (ascending):
Name Age Salary
0 Alice 25 50000
3 David 28 55000
1 Bob 30 60000
2 Charlie 35 70000
Sorted by Salary (descending):
Name Age Salary
2 Charlie 35 70000
1 Bob 30 60000
3 David 28 55000
0 Alice 25 50000
Top 2 highest salaries:
Name Age Salary
2 Charlie 35 70000
1 Bob 30 60000
2 youngest employees:
Name Age Salary
0 Alice 25 50000
3 David 28 55000
β‘ Apply & Lambda
11.4
π§ Custom Transformations
Python
From Source
# Apply and lambda from Session 11
import pandas as pd
df = pd.DataFrame({
'Name': ['alice', 'bob', 'charlie'],
'Price': [100, 200, 150],
'Quantity': [5, 3, 4]
})
print("Original:")
print(df)
# Apply lambda to single column
df['Name'] = df['Name'].apply(lambda x: x.title())
print("\\nCapitalized names:")
print(df)
# Apply lambda to create new column
df['Total'] = df.apply(lambda row: row['Price'] * row['Quantity'], axis=1)
print("\\nWith Total:")
print(df)
# Apply with custom function
def categorize_price(price):
if price < 150:
return 'Budget'
elif price < 200:
return 'Standard'
else:
return 'Premium'
df['Category'] = df['Price'].apply(categorize_price)
print("\\nWith Category:")
print(df)
Output
Original:
Name Price Quantity
0 alice 100 5
1 bob 200 3
2 charlie 150 4
Capitalized names:
Name Price Quantity
0 Alice 100 5
1 Bob 200 3
2 Charlie 150 4
With Total:
Name Price Quantity Total
0 Alice 100 5 500
1 Bob 200 3 600
2 Charlie 150 4 600
With Category:
Name Price Quantity Total Category
0 Alice 100 5 500 Budget
1 Bob 200 3 600 Premium
2 Charlie 150 4 600 Standard
π§Ή Handling Missing Data
11.5
β Dealing with NaN Values
Python
From Source
# Handling missing data from Session 11
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': [1, 2, np.nan, 4],
'B': [5, np.nan, np.nan, 8],
'C': ['x', 'y', 'z', None]
})
print("DataFrame with missing values:")
print(df)
# Check for missing values
print("\\nMissing values:")
print(df.isnull().sum())
# Drop rows with any NaN
print("\\nDrop rows with NaN:")
print(df.dropna())
# Fill NaN with a value
print("\\nFill NaN with 0:")
print(df.fillna(0))
# Fill with column mean (numeric only)
df_numeric = df[['A', 'B']].fillna(df[['A', 'B']].mean())
print("\\nFill with column mean:")
print(df_numeric)
Output
DataFrame with missing values:
A B C
0 1.0 5.0 x
1 2.0 NaN y
2 NaN NaN z
3 4.0 8.0 None
Missing values:
A 1
B 2
C 1
dtype: int64
Drop rows with NaN:
A B C
0 1.0 5.0 x
Fill NaN with 0:
A B C
0 1.0 5.0 x
1 2.0 0.0 y
2 0.0 0.0 z
3 4.0 8.0 0
Fill with column mean:
A B
0 1.000000 5.000000
1 2.000000 6.500000
2 2.333333 6.500000
3 4.000000 8.000000
π Value Counts & Unique
11.6
π’ Counting Values
Python
From Source
# Value counts from Session 11
import pandas as pd
df = pd.DataFrame({
'Department': ['HR', 'IT', 'IT', 'HR', 'Finance', 'IT', 'HR'],
'Level': ['Junior', 'Senior', 'Junior', 'Senior', 'Senior', 'Mid', 'Mid']
})
# Count occurrences
print("Department counts:")
print(df['Department'].value_counts())
# Get unique values
print("\\nUnique departments:", df['Department'].unique())
print("Number of unique:", df['Department'].nunique())
# Value counts as percentage
print("\\nLevel distribution (%):")
print(df['Level'].value_counts(normalize=True) * 100)
Output
Department counts: IT 3 HR 3 Finance 1 Name: Department, dtype: int64 Unique departments: ['HR' 'IT' 'Finance'] Number of unique: 3 Level distribution (%): Senior 42.857143 Junior 28.571429 Mid 28.571429 Name: Level, dtype: float64
π Quick Reference
| Function | Description |
|---|---|
| df[df['col'] > x] | Filter rows |
| df.groupby('col') | Group by column |
| .agg(['mean','sum']) | Multiple aggregations |
| df.sort_values() | Sort data |
| df.nlargest() | Top n values |
| df['col'].apply() | Apply function |
| df.isnull() | Check for NaN |
| df.fillna() | Fill NaN values |
| df.dropna() | Drop NaN rows |
| .value_counts() | Count values |
| .unique() | Get unique values |
π« Common Mistakes (Pandas 2)
- Dropping too much with dropna() β Dropping any row with a NaN can remove most data; use
threshorsubsetorfillna()when appropriate. - SettingWithCopyWarning β When you slice then assign, pandas may warn; use
.loc[]for the assignment or.copy()to make intent clear. - Groupby without aggregation β
groupby()returns a groupby object; you need.sum(),.mean(), etc. to get a result.
π Short reflection
In one sentence: why should you decide whether to use fillna() or dropna() based on the problem and amount of missing data?
β CORE (Must know)
- Filtering,
sort_values(),nlargest();apply()on columns. - Missing data:
isnull(),fillna(),dropna(); choose by context. value_counts(),unique(); groupby and aggregation.
π NON-CORE (Good to know)
- Pivot tables; merge strategies; chained indexing pitfalls.
Complete code from course notebook: Copy_of_pandas_Part_II (1).ipynb
Every line of code from the course notebook (verbatim).
# --- Code cell 2 ---
from google.colab import drive
drive.mount('/content/drive')
# --- Code cell 3 ---
import pandas as pd
import numpy as np
#Creating empty pandas series
data = pd.Series()
print(data)
# simple array
array = np.array([1,5,7,9,12,45,32])
data = pd.Series(array)
print(data)
# --- Code cell 4 ---
data[3] # access by index
# --- Code cell 6 ---
data.index
# --- Code cell 7 ---
array = np.array([1,5,7,9,12,45,32])
columns1=["value"]
df = pd.DataFrame(array,columns=columns1)
df
# --- Code cell 8 ---
df.index
# --- Code cell 9 ---
df.columns
# --- Code cell 11 ---
df.columns
# --- Code cell 12 ---
#Dictionary to dataframe
import pandas as pd
# Define a dictionary containing employee data
data = {'Name':['Rohit', 'Yogesh', 'Aayush'],
'Age':[27, 24, 22],
'Address':['Delhi', 'Kanpur', 'Pune'],
'Qualification':['Msc', 'MA', 'MCA']}
# Convert the dictionary into DataFrame
df = pd.DataFrame(data)
df.head()
# --- Code cell 13 ---
# EDA(Exploratory Data Analyis)---pre processing steps for ML models
# --- Code cell 15 ---
import pandas as pd
data = pd.read_csv("/content/Housing (1).csv") # specify complete path of file if at other location
data
# --- Code cell 16 ---
# how many rows and columns are there
data.shape
# --- Code cell 17 ---
data.head(15) #display the top rows by default 5 rows
# --- Code cell 18 ---
data.tail(10) #display the bottom rows by default 5 rows
# --- Code cell 19 ---
data.sample(4)
# --- Code cell 20 ---
len(data)
# --- Code cell 21 ---
# dataframe information
data.info()
# --- Code cell 22 ---
data.describe() # give you a descrptive stas
# --- Code cell 23 ---
data.describe(include="all") # give you a descrptive stas
# --- Code cell 26 ---
# .loc - name based filtering
# df.loc[start_row_num:stop_row_num:step_row_num, start_col_name:stop_col_name:step_col_num]
# --- Code cell 27 ---
data.tail(10)
# --- Code cell 28 ---
# 35-45rows,area bedrooms bathrooms stories
# --- Code cell 29 ---
data.columns
# --- Code cell 30 ---
data.loc[45:35:-1,'area': 'airconditioning']
# --- Code cell 31 ---
data.loc[65:30:-1,"prefarea":"basement":-2]
# --- Code cell 32 ---
data.loc[50:75:1,["stories","area"]]
# --- Code cell 33 ---
data.loc[0]
# --- Code cell 34 ---
data.loc[1:4]
# --- Code cell 35 ---
data.loc[100:123,["bathrooms","stories","mainroad"]]
# --- Code cell 36 ---
data.loc[50:76,"guestroom"]
# --- Code cell 38 ---
# retrieving row by loc method
data.loc[5]
# --- Code cell 39 ---
data.loc[4,['area',"price"]]
# --- Code cell 41 ---
# Syntax - df.iloc[start_row_Number:stop_row_number, start_col_number:stop_col_number]
# --- Code cell 42 ---
data.head()
# --- Code cell 43 ---
data.loc[45:35:-1,'area': 'airconditioning']
# --- Code cell 44 ---
data.columns
# --- Code cell 45 ---
list(enumerate(data.columns))
# --- Code cell 46 ---
df3=data.iloc[45:35:-1,1:10]
# --- Code cell 47 ---
df3.to_csv("client.csv")
# --- Code cell 48 ---
data.head(30)
# --- Code cell 49 ---
df=pd.read_csv("/content/Housing.csv")
df
# --- Code cell 50 ---
columns=["area","bedrooms","bathrooms","price"]
# --- Code cell 51 ---
df.columns=columns
# --- Code cell 52 ---
data.iloc[21:30:1,11:7:-1]
# --- Code cell 53 ---
data.iloc[66:78,1:4]
# --- Code cell 54 ---
data.iloc[5]
# --- Code cell 55 ---
data.loc[222]
# --- Code cell 56 ---
data["area"][340]
# --- Code cell 57 ---
#select multiple columns
data[['price', 'area','bedrooms']]
# --- Code cell 58 ---
data['price'][10]
# --- Code cell 59 ---
data['price']
# --- Code cell 60 ---
data.head()
# --- Code cell 61 ---
data["furnishingstatus"].unique()
# --- Code cell 62 ---
data["furnishingstatus"].nunique() # count of unique elem
# --- Code cell 63 ---
data["furnishingstatus"].value_counts()
# --- Code cell 64 ---
data.furnishingstatus.value_counts()
# --- Code cell 65 ---
data.furnishingstatus.value_counts(normalize=True).mul(100).round(3)
# --- Code cell 66 ---
len(data)
# --- Code cell 67 ---
#Filter data frame based on column value
data_furnished = data[data['furnishingstatus'] == 'furnished']
#data_furnished.to_csv("furnished_rec.csv")
# --- Code cell 68 ---
data_furnished
# --- Code cell 69 ---
# Filter dataframe on multiple columns - and condition
data_furnished = data[(data['furnishingstatus'] == 'furnished') & (data['parking'] ==1)]
data_furnished.shape
# --- Code cell 70 ---
# Filter dataframe on multiple columns - or condition
data_furnished = data[(data['furnishingstatus'] == 'furnished') | (data['mainroad'] == 'yes')]
data_furnished.head(10)
# --- Code cell 72 ---
data['furnishingstatus'].value_counts()
# --- Code cell 73 ---
data['parking'].value_counts()
# --- Code cell 75 ---
# correlation analyis
# price vs other feature
data.corr(numeric_only=True)
# --- Code cell 76 ---
data['furnishingstatus'].unique()
# --- Code cell 77 ---
# check length of dataframe
len(data)
# --- Code cell 78 ---
# bottom 10 by area
data.nlargest(100, "area")[["price","area","bedrooms","bathrooms","stories","parking"]]
# --- Code cell 79 ---
data.sort_values(by="area", ascending=True)[["price","area","bedrooms","bathrooms","stories","parking"]].head(10)
# --- Code cell 80 ---
data.nsmallest(10, "area")[["price","area","bedrooms","bathrooms","stories","parking"]]
# --- Code cell 82 ---
# GroupBy -used to group out multiple columns in a datset with an agg functions
# Syntax - df.groupby(Col_to_be_groupped).agg({"column_name": "Function_to_be__performed"})
# agg function take a dictionary
# dict - key:value pair, where key is the column,
# value is the function to be done on column
# --- Code cell 83 ---
# avg price by no of bedrooms
# --- Code cell 84 ---
data.groupby("bedrooms")["price"].mean()
# --- Code cell 85 ---
# price distribution by stories
data.groupby("stories")["price"].agg(["count","mean","min","max"])
# --- Code cell 86 ---
data.groupby('bedrooms').agg({"price":"mean","area":"mean"})
# --- Code cell 87 ---
data.groupby(by = 'bedrooms').agg({"price":"mean"})
# --- Code cell 88 ---
data
# --- Code cell 89 ---
data.groupby('bedrooms').agg({"price":"mean","area":"median"})
# --- Code cell 90 ---
data.groupby(by = 'bedrooms')['price'].mean()
# --- Code cell 91 ---
data.groupby(by = ['bedrooms'])[['price','area']].mean()
# --- Code cell 92 ---
# top 10 most expensive properties
data.nlargest(10, "price")
# --- Code cell 93 ---
# cheapest properties
data.nsmallest(10, "price")
# --- Code cell 94 ---
# price vs a/c and Basement
data.groupby(["airconditioning","basement"])["price"].mean()
# --- Code cell 95 ---
# price spread by furnished status
data.groupby("furnishingstatus")["price"].agg(["mean","min","max"])
# --- Code cell 97 ---
data.groupby(by = ['bedrooms', 'bathrooms'])['price'].mean()
# --- Code cell 98 ---
data.groupby(by = ['bedrooms', 'bathrooms'])['price'].median()
# --- Code cell 99 ---
data.groupby(by = ['bedrooms', 'bathrooms'])['price'].min()
# --- Code cell 102 ---
# Default sorting is ascending
data = data.sort_values(by = 'price')
data.head(10)
# --- Code cell 103 ---
# Sorting 'price' column in ascending order and 'area' column in descending order
sorted_df = data.sort_values(by=['price', 'area'], ascending=[True, False])
#sorted_df = data.sort_values(by=['price', 'area'])
sorted_df.head(10)
# --- Code cell 104 ---
print(data['guestroom'].value_counts())
# --- Code cell 105 ---
data["price"].map(lambda x: f"{x:,.0f}") #conversion of e power to a number
# --- Code cell 106 ---
data['guestroom'].value_counts()
# --- Code cell 107 ---
data['guestroom']
# --- Code cell 108 ---
#lambda function
data['guestroom'] = data['guestroom'].apply(lambda x:0 if x=='no' else 1)
print(data['guestroom'].value_counts())