Session 10

🐼 Pandas Part I

Data manipulation with Series and DataFrames

πŸ“š 8 Topics ⏱️ 55 min read 🎯 Intermediate Level

πŸ—ΊοΈ What You'll Learn

πŸ“Š What is Pandas?
πŸ“ˆ Series (1D data)
πŸ“‹ DataFrames (2D data)
πŸ“ Reading CSV Files
πŸ” Data Exploration
πŸ”— Merging Data

πŸ“˜ Same topic in the course notebook

Session_10 Pandas Part I notebook has Series, DataFrame, read_csv, indexing, mergeβ€”same ideas. Run the notebook alongside.

πŸ“Š

Pandas = Excel on Steroids!

    Excel Spreadsheet              Pandas DataFrame
    ─────────────────              ─────────────────
    
    β”‚  A  β”‚  B  β”‚  C  β”‚            β”‚ Name β”‚ Age β”‚ City β”‚
    │─────│─────│─────│            │──────│─────│──────│
    β”‚ Tom β”‚  25 β”‚ NYC β”‚     β†’      β”‚ Tom  β”‚  25 β”‚ NYC  β”‚
    β”‚ Ann β”‚  30 β”‚  LA β”‚            β”‚ Ann  β”‚  30 β”‚  LA  β”‚
    
    Same concept, but with Python superpowers! 🦸
          

Pandas lets you work with tabular data using Python code!

🐼 What is Pandas?

10.1

πŸ“Š The Data Analysis Library

Pandas is Python's most popular library for data manipulation and analysis. It provides two main data structures:

  • Series: 1-dimensional labeled array (like a column)
  • DataFrame: 2-dimensional labeled table (like a spreadsheet)
Python From Source
# Import pandas from Session 10
import pandas as pd
import numpy as np

print("Pandas version:", pd.__version__)

πŸ“ˆ Pandas Series

10.2

πŸ“Š 1-Dimensional Data

A Series is like a single column of data with labels (index) for each value.

Python From Source
# Creating Series from Session 10
import pandas as pd

# From a list
s1 = pd.Series([10, 20, 30, 40, 50])
print("Series from list:")
print(s1)
print()

# With custom index
s2 = pd.Series([100, 200, 300], index=['a', 'b', 'c'])
print("Series with custom index:")
print(s2)
print()

# From a dictionary
ages = {'Alice': 25, 'Bob': 30, 'Charlie': 35}
s3 = pd.Series(ages)
print("Series from dict:")
print(s3)

# Accessing elements
print("\\nAlice's age:", s3['Alice'])
print("First two:")
print(s3[:2])
Output
Series from list:
0    10
1    20
2    30
3    40
4    50
dtype: int64

Series with custom index:
a    100
b    200
c    300
dtype: int64

Series from dict:
Alice      25
Bob        30
Charlie    35
dtype: int64

Alice's age: 25
First two:
Alice    25
Bob      30
dtype: int64

πŸ“‹ Pandas DataFrames

10.3

πŸ“Š 2-Dimensional Tables

A DataFrame is a 2D table with rows and columns - like a spreadsheet or SQL table!

Python From Source
# Creating DataFrames from Session 10
import pandas as pd

# From a dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 28],
    'City': ['NYC', 'LA', 'Chicago', 'Houston'],
    'Salary': [50000, 60000, 70000, 55000]
}
df = pd.DataFrame(data)
print("DataFrame:")
print(df)

# DataFrame info
print("\\nShape:", df.shape)  # (rows, columns)
print("Columns:", list(df.columns))
print("Index:", list(df.index))
print("Data types:")
print(df.dtypes)
Output
DataFrame:
      Name  Age     City  Salary
0    Alice   25      NYC   50000
1      Bob   30       LA   60000
2  Charlie   35  Chicago   70000
3    David   28  Houston   55000

Shape: (4, 4)
Columns: ['Name', 'Age', 'City', 'Salary']
Index: [0, 1, 2, 3]
Data types:
Name      object
Age        int64
City      object
Salary     int64
dtype: object

πŸ“ Reading Data Files

10.4

πŸ“‚ Loading CSV Files

Python From Source
# Reading CSV from Session 10
import pandas as pd

# Read CSV file
# df = pd.read_csv('data.csv')

# Common parameters
# df = pd.read_csv('data.csv', sep=',')         # Specify delimiter
# df = pd.read_csv('data.csv', header=0)        # Row to use as header
# df = pd.read_csv('data.csv', index_col='id')  # Column to use as index
# df = pd.read_csv('data.csv', usecols=['name', 'age'])  # Select columns

# Create sample data for demonstration
df = pd.DataFrame({
    'product': ['Apple', 'Banana', 'Orange', 'Mango'],
    'price': [1.50, 0.50, 0.80, 2.00],
    'quantity': [100, 150, 200, 80]
})

print("Sample DataFrame:")
print(df)
Output
Sample DataFrame:
  product  price  quantity
0   Apple   1.50       100
1  Banana   0.50       150
2  Orange   0.80       200
3   Mango   2.00        80

πŸ” Data Exploration

10.5

πŸ‘€ First Look at Your Data

Python From Source
# Data exploration from Session 10
import pandas as pd

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [25, 30, 35, 28, 32],
    'Salary': [50000, 60000, 70000, 55000, 65000],
    'Department': ['HR', 'IT', 'IT', 'HR', 'Finance']
})

# head() - First n rows
print("First 3 rows:")
print(df.head(3))

# tail() - Last n rows
print("\\nLast 2 rows:")
print(df.tail(2))

# info() - Summary info
print("\\nDataFrame Info:")
df.info()

# describe() - Statistics for numeric columns
print("\\nStatistics:")
print(df.describe())
Output
First 3 rows:
      Name  Age  Salary Department
0    Alice   25   50000         HR
1      Bob   30   60000         IT
2  Charlie   35   70000         IT

Last 2 rows:
    Name  Age  Salary Department
3  David   28   55000         HR
4    Eve   32   65000    Finance

DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Name        5 non-null      object
 1   Age         5 non-null      int64 
 2   Salary      5 non-null      int64 
 3   Department  5 non-null      object
dtypes: int64(2), object(2)

Statistics:
             Age        Salary
count   5.000000      5.000000
mean   30.000000  60000.000000
std     3.807887   7905.694150
min    25.000000  50000.000000
25%    28.000000  55000.000000
50%    30.000000  60000.000000
75%    32.000000  65000.000000
max    35.000000  70000.000000

🎯 Selecting Data

10.6

πŸ“ Accessing Rows & Columns

Python From Source
# Selecting data from Session 10/11
import pandas as pd

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['NYC', 'LA', 'Chicago']
})
print("DataFrame:")
print(df)

# Select single column
print("\\nName column:")
print(df['Name'])

# Select multiple columns
print("\\nName and Age:")
print(df[['Name', 'Age']])

# loc - Select by label
print("\\nRow 0 (loc):")
print(df.loc[0])

# iloc - Select by position
print("\\nFirst 2 rows, first 2 columns (iloc):")
print(df.iloc[:2, :2])
Output
DataFrame:
      Name  Age     City
0    Alice   25      NYC
1      Bob   30       LA
2  Charlie   35  Chicago

Name column:
0      Alice
1        Bob
2    Charlie
Name: Name, dtype: object

Name and Age:
      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35

Row 0 (loc):
Name    Alice
Age        25
City      NYC
Name: 0, dtype: object

First 2 rows, first 2 columns (iloc):
    Name  Age
0  Alice   25
1    Bob   30

βž• Adding & Modifying Columns

10.7

✏️ Creating New Columns

Python From Source
# Adding columns from Session 10
import pandas as pd

df = pd.DataFrame({
    'product': ['Apple', 'Banana', 'Orange'],
    'price': [1.50, 0.50, 0.80],
    'quantity': [100, 150, 200]
})
print("Original:")
print(df)

# Add new column from calculation
df['total_value'] = df['price'] * df['quantity']
print("\\nWith total_value:")
print(df)

# Add column with fixed value
df['currency'] = 'USD'
print("\\nWith currency:")
print(df)

# Add column using apply() with lambda
df['discounted'] = df['price'].apply(lambda x: x * 0.9)
print("\\nWith 10% discount:")
print(df)
Output
Original:
  product  price  quantity
0   Apple   1.50       100
1  Banana   0.50       150
2  Orange   0.80       200

With total_value:
  product  price  quantity  total_value
0   Apple   1.50       100        150.0
1  Banana   0.50       150         75.0
2  Orange   0.80       200        160.0

With currency:
  product  price  quantity  total_value currency
0   Apple   1.50       100        150.0      USD
1  Banana   0.50       150         75.0      USD
2  Orange   0.80       200        160.0      USD

With 10% discount:
  product  price  quantity  total_value currency  discounted
0   Apple   1.50       100        150.0      USD       1.35
1  Banana   0.50       150         75.0      USD       0.45
2  Orange   0.80       200        160.0      USD       0.72

πŸ”— Merging & Concatenating

10.8

πŸ“Ž Combining DataFrames

Python From Source
# Merging DataFrames from Session 10
import pandas as pd

# Two DataFrames to merge
employees = pd.DataFrame({
    'emp_id': [1, 2, 3],
    'name': ['Alice', 'Bob', 'Charlie']
})

salaries = pd.DataFrame({
    'emp_id': [1, 2, 3],
    'salary': [50000, 60000, 70000]
})

print("Employees:")
print(employees)
print("\\nSalaries:")
print(salaries)

# Merge on common column
merged = pd.merge(employees, salaries, on='emp_id')
print("\\nMerged:")
print(merged)

# Concatenate DataFrames vertically
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})

concatenated = pd.concat([df1, df2], ignore_index=True)
print("\\nConcatenated:")
print(concatenated)
Output
Employees:
   emp_id     name
0       1    Alice
1       2      Bob
2       3  Charlie

Salaries:
   emp_id  salary
0       1   50000
1       2   60000
2       3   70000

Merged:
   emp_id     name  salary
0       1    Alice   50000
1       2      Bob   60000
2       3  Charlie   70000

Concatenated:
   A  B
0  1  3
1  2  4
2  5  7
3  6  8

πŸ“‹ Quick Reference

FunctionDescription
pd.DataFrame()Create DataFrame
pd.read_csv()Read CSV file
df.head()First n rows
df.tail()Last n rows
df.info()Summary info
df.describe()Statistics
df.shapeDimensions
df['col']Select column
df.loc[]Select by label
df.iloc[]Select by position
pd.merge()Join DataFrames
pd.concat()Stack DataFrames

🚫 Common Mistakes (Pandas 1)

πŸ’­ Short reflection

In one sentence: when would you use loc vs iloc to select rows from a DataFrame?

βœ… CORE (Must know)

πŸ“š NON-CORE (Good to know)

Complete code from course notebook: Pandas_partt_I (1).ipynb

Every line of code from the course notebook (verbatim).

# --- Code cell 1 ---
# Numpy---array--1D and 2D..
# numpy functions---transpose,random,reshape
# list comprehesnion---write precise
# oops--args,kwargs..real time example(lib management system)

# --- Code cell 2 ---
# Numpy--sci calculations
# pandas--Data analysis
# matplotlib--data viz
# seaborn-- Top of matplotlib

# --- Code cell 3 ---
# pandas---is a fast ,powerful and flexible lib and ease to use and open sourse lib  for data analysis
# pandas---read the data ,edit the data and manuplate the data
# top of python programming lang

# --- Code cell 4 ---
! pip install pandas

# --- Code cell 5 ---
import pandas as pd

# --- Code cell 6 ---
# PANDAS---PANel DAtaframeS
# third party lib--no correction with python
# 1D---series
# 2D--DataFrames
# 3D--PAnel data

# --- Code cell 7 ---
# Data is practically divided into  two types
# @1.Structured Data--------spread sheet ,csv,database(sql),xls,xls worksheet,tabular data
# @2.Unstructured Data------music,video,images and corpus datas/document datas

# --- Code cell 8 ---
# how many ways you can create a dataframe?    ### common interview

# --- Code cell 9 ---
# series--A Series is a one-dimensional array-like object containing a
# sequence of values (of similar types to NumPy types) and
# an associated array of data labels, called its index.

# --- Code cell 10 ---
# creating a series through a list
data=[3,4,5,6,7]
series1=pd.Series(data)
print(series1)

# --- Code cell 11 ---
type(series1)

# --- Code cell 12 ---
# creating a series through dict
dat={"a":1,"b":2,"c":3}
series2=pd.Series(dat)
print(series2)

# --- Code cell 13 ---
# Creating a Series with a custom index
values = [100, 200, 300,500]
index1 = ['x', 'y', 'z',"g"]
series = pd.Series(values, index=index1)
print(series)

# --- Code cell 14 ---
# Creating a Series from a scalar value
scalar = 5
series = pd.Series(scalar, index=['a','b',"d"])

print(series)

# --- Code cell 15 ---
# Creating a Series with mixed data types
data = [1, 'apple', 3.5, True,"aravind"]
series = pd.Series(data)
print(series[3])

# --- Code cell 16 ---
# Creating a Series with DateTime index
data = [100, 200, 300,600]
dates = pd.date_range('2025-10-18',periods=len(data))

series = pd.Series(data, index=dates)

series

# --- Code cell 17 ---
# dataframes--2D--rows and columns

# --- Code cell 18 ---
# creating a dict into DF
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
}

df=pd.DataFrame(data)
df

# --- Code cell 19 ---
# Create a DataFrame from a list of lists
data =[
    ['Alice', 25, 'New York'],
    ['Bob', 30, 'Los Angeles'],
    ['Charlie', 35, 'Chicago'],
    ["name",22,"newyork"],
    ['Charlie', 35]
]
columns1 = ['Name', 'Age', 'City']
df=pd.DataFrame(data,columns=columns1)
df

# --- Code cell 20 ---
# creating  a DF from np
import numpy as np
data=np.array([
    [1,"aravind",30],
    [2,"rohan",65],
    [3,"ravi",34]
])

df=pd.DataFrame(data,columns=columns1)
df

# --- Code cell 21 ---
type(df)

# --- Code cell 22 ---
# Create Series
s1 = pd.Series([1, 2, 3], name='ID')
s2 = pd.Series(['Alice', 'Bob', 'Charlie'], name='Name')
s3 = pd.Series([25, 30, 35], name='Age')
# Combine Series into a DataFrame
df = pd.concat([s1, s2, s3], axis=1)

df

# --- Code cell 23 ---
from numpy import random
data=np.random.randint(310,550,size=(3,3))
data

# --- Code cell 24 ---
from numpy import random
data=np.random.randint(310,550,size=(3,3))
flow=["ID","name","age"]
df=pd.DataFrame(data,columns=flow)
df

# --- Code cell 25 ---
from numpy import random
data=np.random.uniform(310,550,size=(3,3))
flow=["ID","name","age"]
df=pd.DataFrame(data,columns=flow)
df

# --- Code cell 26 ---
dates=pd.date_range("2024-02-01",periods=6)
data={"jan":[10,20,30,11,13,14],"feb":[10,20,30,12,55,66]}
df=pd.DataFrame(data,index=dates)
df

# --- Code cell 27 ---
#  Creating a DataFrame with MultiIndex/Panel Dataframes
# Define a MultiIndex
index = pd.MultiIndex.from_tuples([
    ('A', 'first'),
    ('B', 'second'),
    ('A', 'first'),
    ('B', 'second'),
    ('C', 'Third')
], names=['Category', 'Subcategory'])
# Create DataFrame with MultiIndex
data = {
    'jan': [1, 2, 3, 4,5],'feb': [1, 2, 3, 4,5]
}
df = pd.DataFrame(data, index=index)

df

# --- Code cell 29 ---
df1 = pd.DataFrame({'name': ['rohit',  'akshay', 'aayush','yogesh'],
                    'weight': [67, 82, 53, 75]})

# --- Code cell 30 ---
df1

# --- Code cell 31 ---
df2 = pd.DataFrame({'name': ['aayush','yogesh', 'rohit',  'akshay'],
                    'height': [168, 186, 167, 178]})
df2

# --- Code cell 32 ---
#merge - join two tables

df1 = pd.DataFrame({'name': ['rohit',  'akshay', 'aayush','yogesh'],
                    'weight': [67, 82, 53, 75]})
print(df1.head())
print("\n")
df2 = pd.DataFrame({'name': ['aayush','yogesh', 'rohit',  'akshay'],
                    'height': [168, 186, 167, 178]})
print(df2.head())

# --- Code cell 34 ---
df1 = df1.merge(df2, on='name', how ='inner') # try this again by adding more values in df 1 and how outer
df1.head()

# --- Code cell 35 ---
df1 = pd.DataFrame({'name': ['rohit',  'akshay', 'aayush','yogesh'],
                    'weight': [67, 82, 53, 75]})
print(df1.head())

# --- Code cell 36 ---

df2 = pd.DataFrame({'user_name': ['aayush','yogesh', 'rohit',  'akshay'],
                    'height': [168, 186, 167, 178]})
print(df2.head())

# --- Code cell 37 ---
df1 = df2.merge(df1, left_on='user_name', right_on='name')
# by default its an inner join, how='inner'
df1.head()

# --- Code cell 38 ---
#concat
# join two dataframes along particular axis


df1 = pd.DataFrame({'name': ['rohit',  'akshay', 'aayush','yogesh'],
                    'weight': [67, 82, 53, 75]})
print(df1.head())
print("\n")
df2 = pd.DataFrame({'name': ['aayush','yogesh', 'rohit',  'akshay'],
                    'height': [168, 186, 167, 178]})
print(df2.head())

# --- Code cell 39 ---
#ignore_index=True creates new index range, default is False
# axis = 0 is default value, it joins two dataframes vertically
pd.concat([df1, df2 ], axis=0,ignore_index=True)

# --- Code cell 40 ---

df1 = pd.DataFrame({'name': ['rohit',  'akshay', 'aayush','yogesh'],
                    'weight': [67, 82, 53, 75]})
print(df1.head())
print("\n")
df2 = pd.DataFrame({'user_name': ['rohit',  'akshay', 'aayush','yogesh'],
                    'height': [168, 186, 167, 178]})
print(df2.head())
columns1=["name","height","weight1","weight2"]
pd.concat([df1, df2 ], axis=1,ignore_index=True)

# --- Code cell 41 ---
# loading an extrenal data from kaggle and working with some real use cases