Session 10
πΌ Pandas Part I
Data manipulation with Series and DataFrames
π 8 Topics
β±οΈ 55 min read
π― Intermediate Level
πΊοΈ What You'll Learn
What is Pandas?
Series (1D data)
DataFrames (2D data)
Reading CSV Files
Data Exploration
Merging Data
π Same topic in the course notebook
Session_10 Pandas Part I notebook has Series, DataFrame, read_csv, indexing, mergeβsame ideas. Run the notebook alongside.
Pandas = Excel on Steroids!
Excel Spreadsheet Pandas DataFrame
βββββββββββββββββ βββββββββββββββββ
β A β B β C β β Name β Age β City β
βββββββββββββββββββ βββββββββββββββββββββ
β Tom β 25 β NYC β β β Tom β 25 β NYC β
β Ann β 30 β LA β β Ann β 30 β LA β
Same concept, but with Python superpowers! π¦Έ
Pandas lets you work with tabular data using Python code!
πΌ What is Pandas?
10.1
π The Data Analysis Library
Pandas is Python's most popular library for data manipulation and analysis. It provides two main data structures:
- Series: 1-dimensional labeled array (like a column)
- DataFrame: 2-dimensional labeled table (like a spreadsheet)
Python
From Source
# Import pandas from Session 10
import pandas as pd
import numpy as np
print("Pandas version:", pd.__version__)
π Pandas Series
10.2
π 1-Dimensional Data
A Series is like a single column of data with labels (index) for each value.
Python
From Source
# Creating Series from Session 10
import pandas as pd
# From a list
s1 = pd.Series([10, 20, 30, 40, 50])
print("Series from list:")
print(s1)
print()
# With custom index
s2 = pd.Series([100, 200, 300], index=['a', 'b', 'c'])
print("Series with custom index:")
print(s2)
print()
# From a dictionary
ages = {'Alice': 25, 'Bob': 30, 'Charlie': 35}
s3 = pd.Series(ages)
print("Series from dict:")
print(s3)
# Accessing elements
print("\\nAlice's age:", s3['Alice'])
print("First two:")
print(s3[:2])
Output
Series from list: 0 10 1 20 2 30 3 40 4 50 dtype: int64 Series with custom index: a 100 b 200 c 300 dtype: int64 Series from dict: Alice 25 Bob 30 Charlie 35 dtype: int64 Alice's age: 25 First two: Alice 25 Bob 30 dtype: int64
π Pandas DataFrames
10.3
π 2-Dimensional Tables
A DataFrame is a 2D table with rows and columns - like a spreadsheet or SQL table!
Python
From Source
# Creating DataFrames from Session 10
import pandas as pd
# From a dictionary
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 28],
'City': ['NYC', 'LA', 'Chicago', 'Houston'],
'Salary': [50000, 60000, 70000, 55000]
}
df = pd.DataFrame(data)
print("DataFrame:")
print(df)
# DataFrame info
print("\\nShape:", df.shape) # (rows, columns)
print("Columns:", list(df.columns))
print("Index:", list(df.index))
print("Data types:")
print(df.dtypes)
Output
DataFrame:
Name Age City Salary
0 Alice 25 NYC 50000
1 Bob 30 LA 60000
2 Charlie 35 Chicago 70000
3 David 28 Houston 55000
Shape: (4, 4)
Columns: ['Name', 'Age', 'City', 'Salary']
Index: [0, 1, 2, 3]
Data types:
Name object
Age int64
City object
Salary int64
dtype: object
π Reading Data Files
10.4
π Loading CSV Files
Python
From Source
# Reading CSV from Session 10
import pandas as pd
# Read CSV file
# df = pd.read_csv('data.csv')
# Common parameters
# df = pd.read_csv('data.csv', sep=',') # Specify delimiter
# df = pd.read_csv('data.csv', header=0) # Row to use as header
# df = pd.read_csv('data.csv', index_col='id') # Column to use as index
# df = pd.read_csv('data.csv', usecols=['name', 'age']) # Select columns
# Create sample data for demonstration
df = pd.DataFrame({
'product': ['Apple', 'Banana', 'Orange', 'Mango'],
'price': [1.50, 0.50, 0.80, 2.00],
'quantity': [100, 150, 200, 80]
})
print("Sample DataFrame:")
print(df)
Output
Sample DataFrame: product price quantity 0 Apple 1.50 100 1 Banana 0.50 150 2 Orange 0.80 200 3 Mango 2.00 80
π Data Exploration
10.5
π First Look at Your Data
Python
From Source
# Data exploration from Session 10
import pandas as pd
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Age': [25, 30, 35, 28, 32],
'Salary': [50000, 60000, 70000, 55000, 65000],
'Department': ['HR', 'IT', 'IT', 'HR', 'Finance']
})
# head() - First n rows
print("First 3 rows:")
print(df.head(3))
# tail() - Last n rows
print("\\nLast 2 rows:")
print(df.tail(2))
# info() - Summary info
print("\\nDataFrame Info:")
df.info()
# describe() - Statistics for numeric columns
print("\\nStatistics:")
print(df.describe())
Output
First 3 rows:
Name Age Salary Department
0 Alice 25 50000 HR
1 Bob 30 60000 IT
2 Charlie 35 70000 IT
Last 2 rows:
Name Age Salary Department
3 David 28 55000 HR
4 Eve 32 65000 Finance
DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 5 non-null object
1 Age 5 non-null int64
2 Salary 5 non-null int64
3 Department 5 non-null object
dtypes: int64(2), object(2)
Statistics:
Age Salary
count 5.000000 5.000000
mean 30.000000 60000.000000
std 3.807887 7905.694150
min 25.000000 50000.000000
25% 28.000000 55000.000000
50% 30.000000 60000.000000
75% 32.000000 65000.000000
max 35.000000 70000.000000
π― Selecting Data
10.6
π Accessing Rows & Columns
Python
From Source
# Selecting data from Session 10/11
import pandas as pd
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['NYC', 'LA', 'Chicago']
})
print("DataFrame:")
print(df)
# Select single column
print("\\nName column:")
print(df['Name'])
# Select multiple columns
print("\\nName and Age:")
print(df[['Name', 'Age']])
# loc - Select by label
print("\\nRow 0 (loc):")
print(df.loc[0])
# iloc - Select by position
print("\\nFirst 2 rows, first 2 columns (iloc):")
print(df.iloc[:2, :2])
Output
DataFrame:
Name Age City
0 Alice 25 NYC
1 Bob 30 LA
2 Charlie 35 Chicago
Name column:
0 Alice
1 Bob
2 Charlie
Name: Name, dtype: object
Name and Age:
Name Age
0 Alice 25
1 Bob 30
2 Charlie 35
Row 0 (loc):
Name Alice
Age 25
City NYC
Name: 0, dtype: object
First 2 rows, first 2 columns (iloc):
Name Age
0 Alice 25
1 Bob 30
β Adding & Modifying Columns
10.7
βοΈ Creating New Columns
Python
From Source
# Adding columns from Session 10
import pandas as pd
df = pd.DataFrame({
'product': ['Apple', 'Banana', 'Orange'],
'price': [1.50, 0.50, 0.80],
'quantity': [100, 150, 200]
})
print("Original:")
print(df)
# Add new column from calculation
df['total_value'] = df['price'] * df['quantity']
print("\\nWith total_value:")
print(df)
# Add column with fixed value
df['currency'] = 'USD'
print("\\nWith currency:")
print(df)
# Add column using apply() with lambda
df['discounted'] = df['price'].apply(lambda x: x * 0.9)
print("\\nWith 10% discount:")
print(df)
Output
Original: product price quantity 0 Apple 1.50 100 1 Banana 0.50 150 2 Orange 0.80 200 With total_value: product price quantity total_value 0 Apple 1.50 100 150.0 1 Banana 0.50 150 75.0 2 Orange 0.80 200 160.0 With currency: product price quantity total_value currency 0 Apple 1.50 100 150.0 USD 1 Banana 0.50 150 75.0 USD 2 Orange 0.80 200 160.0 USD With 10% discount: product price quantity total_value currency discounted 0 Apple 1.50 100 150.0 USD 1.35 1 Banana 0.50 150 75.0 USD 0.45 2 Orange 0.80 200 160.0 USD 0.72
π Merging & Concatenating
10.8
π Combining DataFrames
Python
From Source
# Merging DataFrames from Session 10
import pandas as pd
# Two DataFrames to merge
employees = pd.DataFrame({
'emp_id': [1, 2, 3],
'name': ['Alice', 'Bob', 'Charlie']
})
salaries = pd.DataFrame({
'emp_id': [1, 2, 3],
'salary': [50000, 60000, 70000]
})
print("Employees:")
print(employees)
print("\\nSalaries:")
print(salaries)
# Merge on common column
merged = pd.merge(employees, salaries, on='emp_id')
print("\\nMerged:")
print(merged)
# Concatenate DataFrames vertically
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
concatenated = pd.concat([df1, df2], ignore_index=True)
print("\\nConcatenated:")
print(concatenated)
Output
Employees: emp_id name 0 1 Alice 1 2 Bob 2 3 Charlie Salaries: emp_id salary 0 1 50000 1 2 60000 2 3 70000 Merged: emp_id name salary 0 1 Alice 50000 1 2 Bob 60000 2 3 Charlie 70000 Concatenated: A B 0 1 3 1 2 4 2 5 7 3 6 8
π Quick Reference
| Function | Description |
|---|---|
| pd.DataFrame() | Create DataFrame |
| pd.read_csv() | Read CSV file |
| df.head() | First n rows |
| df.tail() | Last n rows |
| df.info() | Summary info |
| df.describe() | Statistics |
| df.shape | Dimensions |
| df['col'] | Select column |
| df.loc[] | Select by label |
| df.iloc[] | Select by position |
| pd.merge() | Join DataFrames |
| pd.concat() | Stack DataFrames |
π« Common Mistakes (Pandas 1)
- Chained indexing β
df['a'][0] = 1may not change the DataFrame; usedf.loc[0, 'a'] = 1to avoid a copy and ensure the change sticks. - loc vs iloc β
locuses labels (index/column names);ilocuses integer positions; mixing them causes KeyError or wrong rows. - Assuming default index β After filtering, index can have gaps; use
.reset_index(drop=True)if you need 0,1,2,... for slicing by position.
π Short reflection
In one sentence: when would you use loc vs iloc to select rows from a DataFrame?
β CORE (Must know)
- DataFrame:
pd.read_csv(),df.head(),df.info(),df.describe(),df.shape. - Selection:
df['col'],df.loc[](labels),df.iloc[](integer position). - Combine:
pd.merge()(joins),pd.concat()(stack).
π NON-CORE (Good to know)
- MultiIndex; groupby basics; dtypes and conversion.
Complete code from course notebook: Pandas_partt_I (1).ipynb
Every line of code from the course notebook (verbatim).
# --- Code cell 1 ---
# Numpy---array--1D and 2D..
# numpy functions---transpose,random,reshape
# list comprehesnion---write precise
# oops--args,kwargs..real time example(lib management system)
# --- Code cell 2 ---
# Numpy--sci calculations
# pandas--Data analysis
# matplotlib--data viz
# seaborn-- Top of matplotlib
# --- Code cell 3 ---
# pandas---is a fast ,powerful and flexible lib and ease to use and open sourse lib for data analysis
# pandas---read the data ,edit the data and manuplate the data
# top of python programming lang
# --- Code cell 4 ---
! pip install pandas
# --- Code cell 5 ---
import pandas as pd
# --- Code cell 6 ---
# PANDAS---PANel DAtaframeS
# third party lib--no correction with python
# 1D---series
# 2D--DataFrames
# 3D--PAnel data
# --- Code cell 7 ---
# Data is practically divided into two types
# @1.Structured Data--------spread sheet ,csv,database(sql),xls,xls worksheet,tabular data
# @2.Unstructured Data------music,video,images and corpus datas/document datas
# --- Code cell 8 ---
# how many ways you can create a dataframe? ### common interview
# --- Code cell 9 ---
# series--A Series is a one-dimensional array-like object containing a
# sequence of values (of similar types to NumPy types) and
# an associated array of data labels, called its index.
# --- Code cell 10 ---
# creating a series through a list
data=[3,4,5,6,7]
series1=pd.Series(data)
print(series1)
# --- Code cell 11 ---
type(series1)
# --- Code cell 12 ---
# creating a series through dict
dat={"a":1,"b":2,"c":3}
series2=pd.Series(dat)
print(series2)
# --- Code cell 13 ---
# Creating a Series with a custom index
values = [100, 200, 300,500]
index1 = ['x', 'y', 'z',"g"]
series = pd.Series(values, index=index1)
print(series)
# --- Code cell 14 ---
# Creating a Series from a scalar value
scalar = 5
series = pd.Series(scalar, index=['a','b',"d"])
print(series)
# --- Code cell 15 ---
# Creating a Series with mixed data types
data = [1, 'apple', 3.5, True,"aravind"]
series = pd.Series(data)
print(series[3])
# --- Code cell 16 ---
# Creating a Series with DateTime index
data = [100, 200, 300,600]
dates = pd.date_range('2025-10-18',periods=len(data))
series = pd.Series(data, index=dates)
series
# --- Code cell 17 ---
# dataframes--2D--rows and columns
# --- Code cell 18 ---
# creating a dict into DF
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']
}
df=pd.DataFrame(data)
df
# --- Code cell 19 ---
# Create a DataFrame from a list of lists
data =[
['Alice', 25, 'New York'],
['Bob', 30, 'Los Angeles'],
['Charlie', 35, 'Chicago'],
["name",22,"newyork"],
['Charlie', 35]
]
columns1 = ['Name', 'Age', 'City']
df=pd.DataFrame(data,columns=columns1)
df
# --- Code cell 20 ---
# creating a DF from np
import numpy as np
data=np.array([
[1,"aravind",30],
[2,"rohan",65],
[3,"ravi",34]
])
df=pd.DataFrame(data,columns=columns1)
df
# --- Code cell 21 ---
type(df)
# --- Code cell 22 ---
# Create Series
s1 = pd.Series([1, 2, 3], name='ID')
s2 = pd.Series(['Alice', 'Bob', 'Charlie'], name='Name')
s3 = pd.Series([25, 30, 35], name='Age')
# Combine Series into a DataFrame
df = pd.concat([s1, s2, s3], axis=1)
df
# --- Code cell 23 ---
from numpy import random
data=np.random.randint(310,550,size=(3,3))
data
# --- Code cell 24 ---
from numpy import random
data=np.random.randint(310,550,size=(3,3))
flow=["ID","name","age"]
df=pd.DataFrame(data,columns=flow)
df
# --- Code cell 25 ---
from numpy import random
data=np.random.uniform(310,550,size=(3,3))
flow=["ID","name","age"]
df=pd.DataFrame(data,columns=flow)
df
# --- Code cell 26 ---
dates=pd.date_range("2024-02-01",periods=6)
data={"jan":[10,20,30,11,13,14],"feb":[10,20,30,12,55,66]}
df=pd.DataFrame(data,index=dates)
df
# --- Code cell 27 ---
# Creating a DataFrame with MultiIndex/Panel Dataframes
# Define a MultiIndex
index = pd.MultiIndex.from_tuples([
('A', 'first'),
('B', 'second'),
('A', 'first'),
('B', 'second'),
('C', 'Third')
], names=['Category', 'Subcategory'])
# Create DataFrame with MultiIndex
data = {
'jan': [1, 2, 3, 4,5],'feb': [1, 2, 3, 4,5]
}
df = pd.DataFrame(data, index=index)
df
# --- Code cell 29 ---
df1 = pd.DataFrame({'name': ['rohit', 'akshay', 'aayush','yogesh'],
'weight': [67, 82, 53, 75]})
# --- Code cell 30 ---
df1
# --- Code cell 31 ---
df2 = pd.DataFrame({'name': ['aayush','yogesh', 'rohit', 'akshay'],
'height': [168, 186, 167, 178]})
df2
# --- Code cell 32 ---
#merge - join two tables
df1 = pd.DataFrame({'name': ['rohit', 'akshay', 'aayush','yogesh'],
'weight': [67, 82, 53, 75]})
print(df1.head())
print("\n")
df2 = pd.DataFrame({'name': ['aayush','yogesh', 'rohit', 'akshay'],
'height': [168, 186, 167, 178]})
print(df2.head())
# --- Code cell 34 ---
df1 = df1.merge(df2, on='name', how ='inner') # try this again by adding more values in df 1 and how outer
df1.head()
# --- Code cell 35 ---
df1 = pd.DataFrame({'name': ['rohit', 'akshay', 'aayush','yogesh'],
'weight': [67, 82, 53, 75]})
print(df1.head())
# --- Code cell 36 ---
df2 = pd.DataFrame({'user_name': ['aayush','yogesh', 'rohit', 'akshay'],
'height': [168, 186, 167, 178]})
print(df2.head())
# --- Code cell 37 ---
df1 = df2.merge(df1, left_on='user_name', right_on='name')
# by default its an inner join, how='inner'
df1.head()
# --- Code cell 38 ---
#concat
# join two dataframes along particular axis
df1 = pd.DataFrame({'name': ['rohit', 'akshay', 'aayush','yogesh'],
'weight': [67, 82, 53, 75]})
print(df1.head())
print("\n")
df2 = pd.DataFrame({'name': ['aayush','yogesh', 'rohit', 'akshay'],
'height': [168, 186, 167, 178]})
print(df2.head())
# --- Code cell 39 ---
#ignore_index=True creates new index range, default is False
# axis = 0 is default value, it joins two dataframes vertically
pd.concat([df1, df2 ], axis=0,ignore_index=True)
# --- Code cell 40 ---
df1 = pd.DataFrame({'name': ['rohit', 'akshay', 'aayush','yogesh'],
'weight': [67, 82, 53, 75]})
print(df1.head())
print("\n")
df2 = pd.DataFrame({'user_name': ['rohit', 'akshay', 'aayush','yogesh'],
'height': [168, 186, 167, 178]})
print(df2.head())
columns1=["name","height","weight1","weight2"]
pd.concat([df1, df2 ], axis=1,ignore_index=True)
# --- Code cell 41 ---
# loading an extrenal data from kaggle and working with some real use cases