Session 11

πŸ“ˆ Pandas Part II

Advanced data analysis and manipulation

πŸ“š 8 Topics ⏱️ 55 min read 🎯 Advanced Level

πŸ—ΊοΈ What You'll Learn

πŸ” Filtering Data
πŸ“Š GroupBy Operations
πŸ“ˆ Sorting & Ranking
πŸ”’ Aggregations
⚑ Apply & Lambda
🧹 Handling Missing Data

πŸ“˜ Same topic in the course notebook

Session_11 Pandas Part II covers loc, iloc, filtering, describe(), groupbyβ€”same as here. Use the notebook to practice.

πŸ” Filtering Data

11.1

🎯 Boolean Indexing

Python From Source
# Filtering from Session 11
import pandas as pd

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [25, 30, 35, 28, 32],
    'Salary': [50000, 60000, 70000, 55000, 65000],
    'Department': ['HR', 'IT', 'IT', 'HR', 'Finance']
})
print("Original DataFrame:")
print(df)

# Filter by single condition
print("\\nAge > 28:")
print(df[df['Age'] > 28])

# Multiple conditions with & (and)
print("\\nAge > 25 AND Salary > 55000:")
print(df[(df['Age'] > 25) & (df['Salary'] > 55000)])

# Multiple conditions with | (or)
print("\\nDepartment is HR OR Salary > 65000:")
print(df[(df['Department'] == 'HR') | (df['Salary'] > 65000)])

# Using isin() for multiple values
print("\\nDepartment in ['HR', 'Finance']:")
print(df[df['Department'].isin(['HR', 'Finance'])])
Output
Original DataFrame:
      Name  Age  Salary Department
0    Alice   25   50000         HR
1      Bob   30   60000         IT
2  Charlie   35   70000         IT
3    David   28   55000         HR
4      Eve   32   65000    Finance

Age > 28:
      Name  Age  Salary Department
1      Bob   30   60000         IT
2  Charlie   35   70000         IT
4      Eve   32   65000    Finance

Age > 25 AND Salary > 55000:
      Name  Age  Salary Department
1      Bob   30   60000         IT
2  Charlie   35   70000         IT
4      Eve   32   65000    Finance

Department is HR OR Salary > 65000:
      Name  Age  Salary Department
0    Alice   25   50000         HR
2  Charlie   35   70000         IT
3    David   28   55000         HR

Department in ['HR', 'Finance']:
    Name  Age  Salary Department
0  Alice   25   50000         HR
3  David   28   55000         HR
4    Eve   32   65000    Finance

πŸ“Š GroupBy Operations

11.2

πŸ—‚οΈ Split-Apply-Combine

πŸ”„ GroupBy Process

    Original Data
         β”‚
         β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚  SPLIT  β”‚  Group by category
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
    β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”
    β–Ό         β–Ό
   HR        IT
         β”‚
         β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚  APPLY  β”‚  Calculate (mean, sum, count...)
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ COMBINE β”‚  Merge results
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          
Python From Source
# GroupBy from Session 11
import pandas as pd

df = pd.DataFrame({
    'Department': ['HR', 'IT', 'IT', 'HR', 'Finance'],
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Salary': [50000, 60000, 70000, 55000, 65000]
})

# Group by Department and calculate mean salary
print("Mean salary by Department:")
print(df.groupby('Department')['Salary'].mean())

# Multiple aggregations
print("\\nMultiple aggregations:")
print(df.groupby('Department')['Salary'].agg(['mean', 'sum', 'count', 'min', 'max']))

# Count employees per department
print("\\nEmployee count by Department:")
print(df.groupby('Department').size())
Output
Mean salary by Department:
Department
Finance    65000.0
HR         52500.0
IT         65000.0
Name: Salary, dtype: float64

Multiple aggregations:
              mean    sum  count    min    max
Department                                    
Finance    65000.0  65000      1  65000  65000
HR         52500.0 105000      2  50000  55000
IT         65000.0 130000      2  60000  70000

Employee count by Department:
Department
Finance    1
HR         2
IT         2
dtype: int64

πŸ“ˆ Sorting & Ranking

11.3

⬆️⬇️ Ordering Data

Python From Source
# Sorting from Session 11
import pandas as pd

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 28],
    'Salary': [50000, 60000, 70000, 55000]
})

# Sort by single column
print("Sorted by Age (ascending):")
print(df.sort_values('Age'))

print("\\nSorted by Salary (descending):")
print(df.sort_values('Salary', ascending=False))

# nlargest and nsmallest
print("\\nTop 2 highest salaries:")
print(df.nlargest(2, 'Salary'))

print("\\n2 youngest employees:")
print(df.nsmallest(2, 'Age'))
Output
Sorted by Age (ascending):
      Name  Age  Salary
0    Alice   25   50000
3    David   28   55000
1      Bob   30   60000
2  Charlie   35   70000

Sorted by Salary (descending):
      Name  Age  Salary
2  Charlie   35   70000
1      Bob   30   60000
3    David   28   55000
0    Alice   25   50000

Top 2 highest salaries:
      Name  Age  Salary
2  Charlie   35   70000
1      Bob   30   60000

2 youngest employees:
    Name  Age  Salary
0  Alice   25   50000
3  David   28   55000

⚑ Apply & Lambda

11.4

πŸ”§ Custom Transformations

Python From Source
# Apply and lambda from Session 11
import pandas as pd

df = pd.DataFrame({
    'Name': ['alice', 'bob', 'charlie'],
    'Price': [100, 200, 150],
    'Quantity': [5, 3, 4]
})
print("Original:")
print(df)

# Apply lambda to single column
df['Name'] = df['Name'].apply(lambda x: x.title())
print("\\nCapitalized names:")
print(df)

# Apply lambda to create new column
df['Total'] = df.apply(lambda row: row['Price'] * row['Quantity'], axis=1)
print("\\nWith Total:")
print(df)

# Apply with custom function
def categorize_price(price):
    if price < 150:
        return 'Budget'
    elif price < 200:
        return 'Standard'
    else:
        return 'Premium'

df['Category'] = df['Price'].apply(categorize_price)
print("\\nWith Category:")
print(df)
Output
Original:
      Name  Price  Quantity
0    alice    100         5
1      bob    200         3
2  charlie    150         4

Capitalized names:
      Name  Price  Quantity
0    Alice    100         5
1      Bob    200         3
2  Charlie    150         4

With Total:
      Name  Price  Quantity  Total
0    Alice    100         5    500
1      Bob    200         3    600
2  Charlie    150         4    600

With Category:
      Name  Price  Quantity  Total  Category
0    Alice    100         5    500    Budget
1      Bob    200         3    600   Premium
2  Charlie    150         4    600  Standard

🧹 Handling Missing Data

11.5

❓ Dealing with NaN Values

Python From Source
# Handling missing data from Session 11
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'A': [1, 2, np.nan, 4],
    'B': [5, np.nan, np.nan, 8],
    'C': ['x', 'y', 'z', None]
})
print("DataFrame with missing values:")
print(df)

# Check for missing values
print("\\nMissing values:")
print(df.isnull().sum())

# Drop rows with any NaN
print("\\nDrop rows with NaN:")
print(df.dropna())

# Fill NaN with a value
print("\\nFill NaN with 0:")
print(df.fillna(0))

# Fill with column mean (numeric only)
df_numeric = df[['A', 'B']].fillna(df[['A', 'B']].mean())
print("\\nFill with column mean:")
print(df_numeric)
Output
DataFrame with missing values:
     A    B     C
0  1.0  5.0     x
1  2.0  NaN     y
2  NaN  NaN     z
3  4.0  8.0  None

Missing values:
A    1
B    2
C    1
dtype: int64

Drop rows with NaN:
     A    B  C
0  1.0  5.0  x

Fill NaN with 0:
     A    B     C
0  1.0  5.0     x
1  2.0  0.0     y
2  0.0  0.0     z
3  4.0  8.0     0

Fill with column mean:
          A         B
0  1.000000  5.000000
1  2.000000  6.500000
2  2.333333  6.500000
3  4.000000  8.000000

πŸ“Š Value Counts & Unique

11.6

πŸ”’ Counting Values

Python From Source
# Value counts from Session 11
import pandas as pd

df = pd.DataFrame({
    'Department': ['HR', 'IT', 'IT', 'HR', 'Finance', 'IT', 'HR'],
    'Level': ['Junior', 'Senior', 'Junior', 'Senior', 'Senior', 'Mid', 'Mid']
})

# Count occurrences
print("Department counts:")
print(df['Department'].value_counts())

# Get unique values
print("\\nUnique departments:", df['Department'].unique())
print("Number of unique:", df['Department'].nunique())

# Value counts as percentage
print("\\nLevel distribution (%):")
print(df['Level'].value_counts(normalize=True) * 100)
Output
Department counts:
IT         3
HR         3
Finance    1
Name: Department, dtype: int64

Unique departments: ['HR' 'IT' 'Finance']
Number of unique: 3

Level distribution (%):
Senior    42.857143
Junior    28.571429
Mid       28.571429
Name: Level, dtype: float64

πŸ“‹ Quick Reference

FunctionDescription
df[df['col'] > x]Filter rows
df.groupby('col')Group by column
.agg(['mean','sum'])Multiple aggregations
df.sort_values()Sort data
df.nlargest()Top n values
df['col'].apply()Apply function
df.isnull()Check for NaN
df.fillna()Fill NaN values
df.dropna()Drop NaN rows
.value_counts()Count values
.unique()Get unique values

🚫 Common Mistakes (Pandas 2)

πŸ’­ Short reflection

In one sentence: why should you decide whether to use fillna() or dropna() based on the problem and amount of missing data?

βœ… CORE (Must know)

πŸ“š NON-CORE (Good to know)

Complete code from course notebook: Copy_of_pandas_Part_II (1).ipynb

Every line of code from the course notebook (verbatim).

# --- Code cell 2 ---
from google.colab import drive
drive.mount('/content/drive')

# --- Code cell 3 ---
import pandas as pd
import numpy as np

#Creating empty pandas series

data = pd.Series()
print(data)

# simple array
array = np.array([1,5,7,9,12,45,32])
data = pd.Series(array)
print(data)

# --- Code cell 4 ---
data[3] # access by index

# --- Code cell 6 ---
data.index

# --- Code cell 7 ---
array = np.array([1,5,7,9,12,45,32])
columns1=["value"]
df = pd.DataFrame(array,columns=columns1)
df

# --- Code cell 8 ---
df.index

# --- Code cell 9 ---
df.columns

# --- Code cell 11 ---
df.columns

# --- Code cell 12 ---
#Dictionary to dataframe

import pandas as pd

# Define a dictionary containing employee data
data = {'Name':['Rohit', 'Yogesh', 'Aayush'],
        'Age':[27, 24, 22],
        'Address':['Delhi', 'Kanpur', 'Pune'],
        'Qualification':['Msc', 'MA', 'MCA']}

# Convert the dictionary into DataFrame
df = pd.DataFrame(data)
df.head()

# --- Code cell 13 ---
# EDA(Exploratory Data Analyis)---pre processing steps for ML models

# --- Code cell 15 ---
import pandas as pd
data = pd.read_csv("/content/Housing (1).csv")  # specify complete path of file if at other location
data

# --- Code cell 16 ---
# how many rows and columns are there
data.shape

# --- Code cell 17 ---
data.head(15) #display the top rows  by default 5 rows

# --- Code cell 18 ---
data.tail(10)  #display the bottom rows  by default 5 rows

# --- Code cell 19 ---
data.sample(4)

# --- Code cell 20 ---
len(data)

# --- Code cell 21 ---
# dataframe information
data.info()

# --- Code cell 22 ---
data.describe()  # give you a descrptive stas

# --- Code cell 23 ---
data.describe(include="all")  # give you a descrptive stas

# --- Code cell 26 ---
# .loc - name based filtering
# df.loc[start_row_num:stop_row_num:step_row_num, start_col_name:stop_col_name:step_col_num]

# --- Code cell 27 ---
data.tail(10)

# --- Code cell 28 ---
# 35-45rows,area	bedrooms	bathrooms	stories

# --- Code cell 29 ---
data.columns

# --- Code cell 30 ---
data.loc[45:35:-1,'area': 'airconditioning']

# --- Code cell 31 ---
data.loc[65:30:-1,"prefarea":"basement":-2]

# --- Code cell 32 ---
data.loc[50:75:1,["stories","area"]]

# --- Code cell 33 ---
data.loc[0]

# --- Code cell 34 ---
data.loc[1:4]

# --- Code cell 35 ---
data.loc[100:123,["bathrooms","stories","mainroad"]]

# --- Code cell 36 ---
data.loc[50:76,"guestroom"]

# --- Code cell 38 ---
# retrieving row by loc method
data.loc[5]

# --- Code cell 39 ---
data.loc[4,['area',"price"]]

# --- Code cell 41 ---
# Syntax - df.iloc[start_row_Number:stop_row_number, start_col_number:stop_col_number]

# --- Code cell 42 ---
data.head()

# --- Code cell 43 ---
data.loc[45:35:-1,'area': 'airconditioning']

# --- Code cell 44 ---
data.columns

# --- Code cell 45 ---
list(enumerate(data.columns))

# --- Code cell 46 ---
df3=data.iloc[45:35:-1,1:10]

# --- Code cell 47 ---
df3.to_csv("client.csv")

# --- Code cell 48 ---
data.head(30)

# --- Code cell 49 ---
df=pd.read_csv("/content/Housing.csv")
df

# --- Code cell 50 ---
columns=["area","bedrooms","bathrooms","price"]

# --- Code cell 51 ---
df.columns=columns

# --- Code cell 52 ---
data.iloc[21:30:1,11:7:-1]

# --- Code cell 53 ---
data.iloc[66:78,1:4]

# --- Code cell 54 ---
data.iloc[5]

# --- Code cell 55 ---
data.loc[222]

# --- Code cell 56 ---
data["area"][340]

# --- Code cell 57 ---
#select multiple columns
data[['price', 'area','bedrooms']]

# --- Code cell 58 ---
data['price'][10]

# --- Code cell 59 ---
data['price']

# --- Code cell 60 ---
data.head()

# --- Code cell 61 ---
data["furnishingstatus"].unique()

# --- Code cell 62 ---
data["furnishingstatus"].nunique()  # count of unique elem

# --- Code cell 63 ---
data["furnishingstatus"].value_counts()

# --- Code cell 64 ---
data.furnishingstatus.value_counts()

# --- Code cell 65 ---
data.furnishingstatus.value_counts(normalize=True).mul(100).round(3)

# --- Code cell 66 ---
len(data)

# --- Code cell 67 ---
#Filter data frame based on column value

data_furnished = data[data['furnishingstatus'] == 'furnished']
#data_furnished.to_csv("furnished_rec.csv")

# --- Code cell 68 ---
data_furnished

# --- Code cell 69 ---
# Filter dataframe on multiple columns - and condition
data_furnished = data[(data['furnishingstatus'] == 'furnished') & (data['parking'] ==1)]
data_furnished.shape

# --- Code cell 70 ---
# Filter dataframe on multiple columns - or condition
data_furnished = data[(data['furnishingstatus'] == 'furnished') | (data['mainroad'] == 'yes')]
data_furnished.head(10)

# --- Code cell 72 ---

data['furnishingstatus'].value_counts()

# --- Code cell 73 ---
data['parking'].value_counts()

# --- Code cell 75 ---
# correlation analyis
# price vs other feature
data.corr(numeric_only=True)

# --- Code cell 76 ---

data['furnishingstatus'].unique()

# --- Code cell 77 ---
# check length of dataframe
len(data)

# --- Code cell 78 ---
# bottom 10 by area
data.nlargest(100, "area")[["price","area","bedrooms","bathrooms","stories","parking"]]

# --- Code cell 79 ---
data.sort_values(by="area", ascending=True)[["price","area","bedrooms","bathrooms","stories","parking"]].head(10)

# --- Code cell 80 ---
data.nsmallest(10, "area")[["price","area","bedrooms","bathrooms","stories","parking"]]

# --- Code cell 82 ---
# GroupBy -used to group out multiple columns in a datset with an agg functions
# Syntax - df.groupby(Col_to_be_groupped).agg({"column_name": "Function_to_be__performed"})
# agg function take a dictionary
# dict - key:value pair, where key is the column,
# value is the function to be done on column

# --- Code cell 83 ---
# avg price by no of bedrooms

# --- Code cell 84 ---
data.groupby("bedrooms")["price"].mean()

# --- Code cell 85 ---
# price distribution by stories
data.groupby("stories")["price"].agg(["count","mean","min","max"])

# --- Code cell 86 ---
data.groupby('bedrooms').agg({"price":"mean","area":"mean"})

# --- Code cell 87 ---
data.groupby(by = 'bedrooms').agg({"price":"mean"})

# --- Code cell 88 ---
data

# --- Code cell 89 ---
data.groupby('bedrooms').agg({"price":"mean","area":"median"})

# --- Code cell 90 ---
data.groupby(by = 'bedrooms')['price'].mean()

# --- Code cell 91 ---
data.groupby(by = ['bedrooms'])[['price','area']].mean()

# --- Code cell 92 ---
# top 10 most expensive properties
data.nlargest(10, "price")

# --- Code cell 93 ---
# cheapest properties
data.nsmallest(10, "price")

# --- Code cell 94 ---
# price vs a/c and Basement
data.groupby(["airconditioning","basement"])["price"].mean()

# --- Code cell 95 ---
# price spread by furnished status
data.groupby("furnishingstatus")["price"].agg(["mean","min","max"])

# --- Code cell 97 ---
data.groupby(by = ['bedrooms', 'bathrooms'])['price'].mean()

# --- Code cell 98 ---
data.groupby(by = ['bedrooms', 'bathrooms'])['price'].median()

# --- Code cell 99 ---
data.groupby(by = ['bedrooms', 'bathrooms'])['price'].min()

# --- Code cell 102 ---
# Default sorting is ascending
data = data.sort_values(by = 'price')
data.head(10)

# --- Code cell 103 ---
# Sorting 'price' column in ascending order and 'area' column in descending order
sorted_df = data.sort_values(by=['price', 'area'], ascending=[True, False])
#sorted_df = data.sort_values(by=['price', 'area'])
sorted_df.head(10)

# --- Code cell 104 ---
print(data['guestroom'].value_counts())

# --- Code cell 105 ---
data["price"].map(lambda x: f"{x:,.0f}") #conversion of e power to a number

# --- Code cell 106 ---
data['guestroom'].value_counts()

# --- Code cell 107 ---
data['guestroom']

# --- Code cell 108 ---
#lambda function

data['guestroom'] = data['guestroom'].apply(lambda x:0 if x=='no' else 1)
print(data['guestroom'].value_counts())