Session 10

🐼 Pandas Part I

Data manipulation with Series and DataFrames

📚 8 Topics ⏱️ 55 min read 🎯 Intermediate Level

🗺️ What You'll Learn

📊 What is Pandas?

📈 Series (1D data)

📋 DataFrames (2D data)

📁 Reading CSV Files

🔍 Data Exploration

🔗 Merging Data

📘 Same topic in the course notebook

Session_10 Pandas Part I notebook has Series, DataFrame, read_csv, indexing, merge—same ideas. Run the notebook alongside.

📊

Pandas = Excel on Steroids!

    Excel Spreadsheet              Pandas DataFrame
    ─────────────────              ─────────────────
    
    │  A  │  B  │  C  │            │ Name │ Age │ City │
    │─────│─────│─────│            │──────│─────│──────│
    │ Tom │  25 │ NYC │     →      │ Tom  │  25 │ NYC  │
    │ Ann │  30 │  LA │            │ Ann  │  30 │  LA  │
    
    Same concept, but with Python superpowers! 🦸

Pandas lets you work with tabular data using Python code!

🐼 What is Pandas?

10.1

📊 The Data Analysis Library

Pandas is Python's most popular library for data manipulation and analysis. It provides two main data structures:

Series: 1-dimensional labeled array (like a column)
DataFrame: 2-dimensional labeled table (like a spreadsheet)

Python From Source

# Import pandas from Session 10
import pandas as pd
import numpy as np

print("Pandas version:", pd.__version__)

📈 Pandas Series

10.2

📊 1-Dimensional Data

A Series is like a single column of data with labels (index) for each value.

Python From Source

# Creating Series from Session 10
import pandas as pd

# From a list
s1 = pd.Series([10, 20, 30, 40, 50])
print("Series from list:")
print(s1)
print()

# With custom index
s2 = pd.Series([100, 200, 300], index=['a', 'b', 'c'])
print("Series with custom index:")
print(s2)
print()

# From a dictionary
ages = {'Alice': 25, 'Bob': 30, 'Charlie': 35}
s3 = pd.Series(ages)
print("Series from dict:")
print(s3)

# Accessing elements
print("\\nAlice's age:", s3['Alice'])
print("First two:")
print(s3[:2])

Output

Series from list:
0    10
1    20
2    30
3    40
4    50
dtype: int64

Series with custom index:
a    100
b    200
c    300
dtype: int64

Series from dict:
Alice      25
Bob        30
Charlie    35
dtype: int64

Alice's age: 25
First two:
Alice    25
Bob      30
dtype: int64

📋 Pandas DataFrames

10.3

📊 2-Dimensional Tables

A DataFrame is a 2D table with rows and columns - like a spreadsheet or SQL table!

Python From Source

# Creating DataFrames from Session 10
import pandas as pd

# From a dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 28],
    'City': ['NYC', 'LA', 'Chicago', 'Houston'],
    'Salary': [50000, 60000, 70000, 55000]
}
df = pd.DataFrame(data)
print("DataFrame:")
print(df)

# DataFrame info
print("\\nShape:", df.shape)  # (rows, columns)
print("Columns:", list(df.columns))
print("Index:", list(df.index))
print("Data types:")
print(df.dtypes)

Output

DataFrame:
      Name  Age     City  Salary
0    Alice   25      NYC   50000
1      Bob   30       LA   60000
2  Charlie   35  Chicago   70000
3    David   28  Houston   55000

Shape: (4, 4)
Columns: ['Name', 'Age', 'City', 'Salary']
Index: [0, 1, 2, 3]
Data types:
Name      object
Age        int64
City      object
Salary     int64
dtype: object

📁 Reading Data Files

10.4

📂 Loading CSV Files

Python From Source

# Reading CSV from Session 10
import pandas as pd

# Read CSV file
# df = pd.read_csv('data.csv')

# Common parameters
# df = pd.read_csv('data.csv', sep=',')         # Specify delimiter
# df = pd.read_csv('data.csv', header=0)        # Row to use as header
# df = pd.read_csv('data.csv', index_col='id')  # Column to use as index
# df = pd.read_csv('data.csv', usecols=['name', 'age'])  # Select columns

# Create sample data for demonstration
df = pd.DataFrame({
    'product': ['Apple', 'Banana', 'Orange', 'Mango'],
    'price': [1.50, 0.50, 0.80, 2.00],
    'quantity': [100, 150, 200, 80]
})

print("Sample DataFrame:")
print(df)

Output

Sample DataFrame:
  product  price  quantity
0   Apple   1.50       100
1  Banana   0.50       150
2  Orange   0.80       200
3   Mango   2.00        80

🔍 Data Exploration

10.5

👀 First Look at Your Data

Python From Source

# Data exploration from Session 10
import pandas as pd

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [25, 30, 35, 28, 32],
    'Salary': [50000, 60000, 70000, 55000, 65000],
    'Department': ['HR', 'IT', 'IT', 'HR', 'Finance']
})

# head() - First n rows
print("First 3 rows:")
print(df.head(3))

# tail() - Last n rows
print("\\nLast 2 rows:")
print(df.tail(2))

# info() - Summary info
print("\\nDataFrame Info:")
df.info()

# describe() - Statistics for numeric columns
print("\\nStatistics:")
print(df.describe())

Output

First 3 rows:
      Name  Age  Salary Department
0    Alice   25   50000         HR
1      Bob   30   60000         IT
2  Charlie   35   70000         IT

Last 2 rows:
    Name  Age  Salary Department
3  David   28   55000         HR
4    Eve   32   65000    Finance

DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Name        5 non-null      object
 1   Age         5 non-null      int64 
 2   Salary      5 non-null      int64 
 3   Department  5 non-null      object
dtypes: int64(2), object(2)

Statistics:
             Age        Salary
count   5.000000      5.000000
mean   30.000000  60000.000000
std     3.807887   7905.694150
min    25.000000  50000.000000
25%    28.000000  55000.000000
50%    30.000000  60000.000000
75%    32.000000  65000.000000
max    35.000000  70000.000000

🎯 Selecting Data

10.6

📍 Accessing Rows & Columns

Python From Source

# Selecting data from Session 10/11
import pandas as pd

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['NYC', 'LA', 'Chicago']
})
print("DataFrame:")
print(df)

# Select single column
print("\\nName column:")
print(df['Name'])

# Select multiple columns
print("\\nName and Age:")
print(df[['Name', 'Age']])

# loc - Select by label
print("\\nRow 0 (loc):")
print(df.loc[0])

# iloc - Select by position
print("\\nFirst 2 rows, first 2 columns (iloc):")
print(df.iloc[:2, :2])

Output

DataFrame:
      Name  Age     City
0    Alice   25      NYC
1      Bob   30       LA
2  Charlie   35  Chicago

Name column:
0      Alice
1        Bob
2    Charlie
Name: Name, dtype: object

Name and Age:
      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35

Row 0 (loc):
Name    Alice
Age        25
City      NYC
Name: 0, dtype: object

First 2 rows, first 2 columns (iloc):
    Name  Age
0  Alice   25
1    Bob   30

➕ Adding & Modifying Columns

10.7

✏️ Creating New Columns

Python From Source

# Adding columns from Session 10
import pandas as pd

df = pd.DataFrame({
    'product': ['Apple', 'Banana', 'Orange'],
    'price': [1.50, 0.50, 0.80],
    'quantity': [100, 150, 200]
})
print("Original:")
print(df)

# Add new column from calculation
df['total_value'] = df['price'] * df['quantity']
print("\\nWith total_value:")
print(df)

# Add column with fixed value
df['currency'] = 'USD'
print("\\nWith currency:")
print(df)

# Add column using apply() with lambda
df['discounted'] = df['price'].apply(lambda x: x * 0.9)
print("\\nWith 10% discount:")
print(df)

Output

Original:
  product  price  quantity
0   Apple   1.50       100
1  Banana   0.50       150
2  Orange   0.80       200

With total_value:
  product  price  quantity  total_value
0   Apple   1.50       100        150.0
1  Banana   0.50       150         75.0
2  Orange   0.80       200        160.0

With currency:
  product  price  quantity  total_value currency
0   Apple   1.50       100        150.0      USD
1  Banana   0.50       150         75.0      USD
2  Orange   0.80       200        160.0      USD

With 10% discount:
  product  price  quantity  total_value currency  discounted
0   Apple   1.50       100        150.0      USD       1.35
1  Banana   0.50       150         75.0      USD       0.45
2  Orange   0.80       200        160.0      USD       0.72

🔗 Merging & Concatenating

10.8

📎 Combining DataFrames

Python From Source

# Merging DataFrames from Session 10
import pandas as pd

# Two DataFrames to merge
employees = pd.DataFrame({
    'emp_id': [1, 2, 3],
    'name': ['Alice', 'Bob', 'Charlie']
})

salaries = pd.DataFrame({
    'emp_id': [1, 2, 3],
    'salary': [50000, 60000, 70000]
})

print("Employees:")
print(employees)
print("\\nSalaries:")
print(salaries)

# Merge on common column
merged = pd.merge(employees, salaries, on='emp_id')
print("\\nMerged:")
print(merged)

# Concatenate DataFrames vertically
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})

concatenated = pd.concat([df1, df2], ignore_index=True)
print("\\nConcatenated:")
print(concatenated)

Output

Employees:
   emp_id     name
0       1    Alice
1       2      Bob
2       3  Charlie

Salaries:
   emp_id  salary
0       1   50000
1       2   60000
2       3   70000

Merged:
   emp_id     name  salary
0       1    Alice   50000
1       2      Bob   60000
2       3  Charlie   70000

Concatenated:
   A  B
0  1  3
1  2  4
2  5  7
3  6  8

📋 Quick Reference

Function	Description
pd.DataFrame()	Create DataFrame
pd.read_csv()	Read CSV file
df.head()	First n rows
df.tail()	Last n rows
df.info()	Summary info
df.describe()	Statistics
df.shape	Dimensions
df['col']	Select column
df.loc[]	Select by label
df.iloc[]	Select by position
pd.merge()	Join DataFrames
pd.concat()	Stack DataFrames

🚫 Common Mistakes (Pandas 1)

Chained indexing — df['a'][0] = 1 may not change the DataFrame; use df.loc[0, 'a'] = 1 to avoid a copy and ensure the change sticks.
loc vs iloc — loc uses labels (index/column names); iloc uses integer positions; mixing them causes KeyError or wrong rows.
Assuming default index — After filtering, index can have gaps; use .reset_index(drop=True) if you need 0,1,2,... for slicing by position.

💭 Short reflection

In one sentence: when would you use loc vs iloc to select rows from a DataFrame?

✅ CORE (Must know)

DataFrame: pd.read_csv(), df.head(), df.info(), df.describe(), df.shape.
Selection: df['col'], df.loc[] (labels), df.iloc[] (integer position).
Combine: pd.merge() (joins), pd.concat() (stack).

📚 NON-CORE (Good to know)

MultiIndex; groupby basics; dtypes and conversion.

Complete code from course notebook: Pandas_partt_I (1).ipynb

Every line of code from the course notebook (verbatim).

# --- Code cell 1 ---
# Numpy---array--1D and 2D..
# numpy functions---transpose,random,reshape
# list comprehesnion---write precise
# oops--args,kwargs..real time example(lib management system)

# --- Code cell 2 ---
# Numpy--sci calculations
# pandas--Data analysis
# matplotlib--data viz
# seaborn-- Top of matplotlib

# --- Code cell 3 ---
# pandas---is a fast ,powerful and flexible lib and ease to use and open sourse lib  for data analysis
# pandas---read the data ,edit the data and manuplate the data
# top of python programming lang

# --- Code cell 4 ---
! pip install pandas

# --- Code cell 5 ---
import pandas as pd

# --- Code cell 6 ---
# PANDAS---PANel DAtaframeS
# third party lib--no correction with python
# 1D---series
# 2D--DataFrames
# 3D--PAnel data

# --- Code cell 7 ---
# Data is practically divided into  two types
# @1.Structured Data--------spread sheet ,csv,database(sql),xls,xls worksheet,tabular data
# @2.Unstructured Data------music,video,images and corpus datas/document datas

# --- Code cell 8 ---
# how many ways you can create a dataframe?    ### common interview

# --- Code cell 9 ---
# series--A Series is a one-dimensional array-like object containing a
# sequence of values (of similar types to NumPy types) and
# an associated array of data labels, called its index.

# --- Code cell 10 ---
# creating a series through a list
data=[3,4,5,6,7]
series1=pd.Series(data)
print(series1)

# --- Code cell 11 ---
type(series1)

# --- Code cell 12 ---
# creating a series through dict
dat={"a":1,"b":2,"c":3}
series2=pd.Series(dat)
print(series2)

# --- Code cell 13 ---
# Creating a Series with a custom index
values = [100, 200, 300,500]
index1 = ['x', 'y', 'z',"g"]
series = pd.Series(values, index=index1)
print(series)

# --- Code cell 14 ---
# Creating a Series from a scalar value
scalar = 5
series = pd.Series(scalar, index=['a','b',"d"])

print(series)

# --- Code cell 15 ---
# Creating a Series with mixed data types
data = [1, 'apple', 3.5, True,"aravind"]
series = pd.Series(data)
print(series[3])

# --- Code cell 16 ---
# Creating a Series with DateTime index
data = [100, 200, 300,600]
dates = pd.date_range('2025-10-18',periods=len(data))

series = pd.Series(data, index=dates)

series

# --- Code cell 17 ---
# dataframes--2D--rows and columns

# --- Code cell 18 ---
# creating a dict into DF
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
}

df=pd.DataFrame(data)
df

# --- Code cell 19 ---
# Create a DataFrame from a list of lists
data =[
    ['Alice', 25, 'New York'],
    ['Bob', 30, 'Los Angeles'],
    ['Charlie', 35, 'Chicago'],
    ["name",22,"newyork"],
    ['Charlie', 35]
]
columns1 = ['Name', 'Age', 'City']
df=pd.DataFrame(data,columns=columns1)
df

# --- Code cell 20 ---
# creating  a DF from np
import numpy as np
data=np.array([
    [1,"aravind",30],
    [2,"rohan",65],
    [3,"ravi",34]
])

df=pd.DataFrame(data,columns=columns1)
df

# --- Code cell 21 ---
type(df)

# --- Code cell 22 ---
# Create Series
s1 = pd.Series([1, 2, 3], name='ID')
s2 = pd.Series(['Alice', 'Bob', 'Charlie'], name='Name')
s3 = pd.Series([25, 30, 35], name='Age')
# Combine Series into a DataFrame
df = pd.concat([s1, s2, s3], axis=1)

df

# --- Code cell 23 ---
from numpy import random
data=np.random.randint(310,550,size=(3,3))
data

# --- Code cell 24 ---
from numpy import random
data=np.random.randint(310,550,size=(3,3))
flow=["ID","name","age"]
df=pd.DataFrame(data,columns=flow)
df

# --- Code cell 25 ---
from numpy import random
data=np.random.uniform(310,550,size=(3,3))
flow=["ID","name","age"]
df=pd.DataFrame(data,columns=flow)
df

# --- Code cell 26 ---
dates=pd.date_range("2024-02-01",periods=6)
data={"jan":[10,20,30,11,13,14],"feb":[10,20,30,12,55,66]}
df=pd.DataFrame(data,index=dates)
df

# --- Code cell 27 ---
#  Creating a DataFrame with MultiIndex/Panel Dataframes
# Define a MultiIndex
index = pd.MultiIndex.from_tuples([
    ('A', 'first'),
    ('B', 'second'),
    ('A', 'first'),
    ('B', 'second'),
    ('C', 'Third')
], names=['Category', 'Subcategory'])
# Create DataFrame with MultiIndex
data = {
    'jan': [1, 2, 3, 4,5],'feb': [1, 2, 3, 4,5]
}
df = pd.DataFrame(data, index=index)

df

# --- Code cell 29 ---
df1 = pd.DataFrame({'name': ['rohit',  'akshay', 'aayush','yogesh'],
                    'weight': [67, 82, 53, 75]})

# --- Code cell 30 ---
df1

# --- Code cell 31 ---
df2 = pd.DataFrame({'name': ['aayush','yogesh', 'rohit',  'akshay'],
                    'height': [168, 186, 167, 178]})
df2

# --- Code cell 32 ---
#merge - join two tables

df1 = pd.DataFrame({'name': ['rohit',  'akshay', 'aayush','yogesh'],
                    'weight': [67, 82, 53, 75]})
print(df1.head())
print("\n")
df2 = pd.DataFrame({'name': ['aayush','yogesh', 'rohit',  'akshay'],
                    'height': [168, 186, 167, 178]})
print(df2.head())

# --- Code cell 34 ---
df1 = df1.merge(df2, on='name', how ='inner') # try this again by adding more values in df 1 and how outer
df1.head()

# --- Code cell 35 ---
df1 = pd.DataFrame({'name': ['rohit',  'akshay', 'aayush','yogesh'],
                    'weight': [67, 82, 53, 75]})
print(df1.head())

# --- Code cell 36 ---

df2 = pd.DataFrame({'user_name': ['aayush','yogesh', 'rohit',  'akshay'],
                    'height': [168, 186, 167, 178]})
print(df2.head())

# --- Code cell 37 ---
df1 = df2.merge(df1, left_on='user_name', right_on='name')
# by default its an inner join, how='inner'
df1.head()

# --- Code cell 38 ---
#concat
# join two dataframes along particular axis


df1 = pd.DataFrame({'name': ['rohit',  'akshay', 'aayush','yogesh'],
                    'weight': [67, 82, 53, 75]})
print(df1.head())
print("\n")
df2 = pd.DataFrame({'name': ['aayush','yogesh', 'rohit',  'akshay'],
                    'height': [168, 186, 167, 178]})
print(df2.head())

# --- Code cell 39 ---
#ignore_index=True creates new index range, default is False
# axis = 0 is default value, it joins two dataframes vertically
pd.concat([df1, df2 ], axis=0,ignore_index=True)

# --- Code cell 40 ---

df1 = pd.DataFrame({'name': ['rohit',  'akshay', 'aayush','yogesh'],
                    'weight': [67, 82, 53, 75]})
print(df1.head())
print("\n")
df2 = pd.DataFrame({'user_name': ['rohit',  'akshay', 'aayush','yogesh'],
                    'height': [168, 186, 167, 178]})
print(df2.head())
columns1=["name","height","weight1","weight2"]
pd.concat([df1, df2 ], axis=1,ignore_index=True)

# --- Code cell 41 ---
# loading an extrenal data from kaggle and working with some real use cases