Web Scraping Using Python

Extract data from websites automatically with requests and BeautifulSoup.

🕷️ What is Web Scraping?

Web scraping is automatically getting data from websites using code. Instead of copying and pasting by hand, we write a script that visits a page, reads its HTML, and pulls out the bits we want (headlines, prices, reviews, etc.). Data scientists use it when an API is not available.

Why it's useful: Gather data for analysis; monitor prices or news; automate repetitive web tasks.

📚 Libraries You Need

requests — Downloads the raw HTML of a webpage (like opening the page in a browser and saving the source).
beautifulsoup4 — Parses that HTML so we can search for tags, classes, and IDs and extract text or links.

Install once: pip install requests beautifulsoup4

requests

Fetches the page. You get back a Response object; .content gives the HTML bytes.

BeautifulSoup

Turns messy HTML into a structure you can search: find by tag name, class, or id and get text or attributes.

🔍 Static vs Dynamic Sites

Static: All content is in the HTML when the page loads (e.g. Wikipedia, many docs sites). You can scrape with requests + BeautifulSoup.

Dynamic: Content is loaded or updated with JavaScript after the page loads (e.g. many news feeds, social media). BeautifulSoup alone cannot see that content; you need something like Selenium or Playwright, or an API.

📝 Basic HTML (Simple Idea)

Web pages are built with HTML tags. For scraping you only need a rough idea:

<h1> … <h6> — Headings
<p> — Paragraphs
<a href="..."> — Links
<div> / <span> — Containers for grouping
<table>, <tr>, <td> — Tables

BeautifulSoup lets you find these by name, or by class="..." / id="...".

✏️ Minimal Code (Same idea as the course notebook)

In the course source notebook (Web_Scraping_Using_Python), you'll see:

import requests
from bs4 import BeautifulSoup

url = "https://books.toscrape.com/catalogue/category/books_1/index.html"
response = requests.get(url)   # Download the page
# response.status_code 200 means OK
html = response.content
soup = BeautifulSoup(html, "html.parser")   # Parse HTML
# Then use soup.find("tag") or soup.find_all("tag") to get elements

In simple terms: requests.get(url) brings the page; BeautifulSoup(html, "html.parser") turns it into a searchable tree; then you use .find() or .find_all() to get the parts you need and .text or ["href"] to get text or links.

✅ Takeaway

Web scraping = programmatically extracting data from websites.
Use requests to fetch HTML and BeautifulSoup to parse and extract.
Works best on static pages; for JavaScript-heavy sites you may need Selenium/Playwright or an API.

Complete code from course notebook: Web_Scraping_Using_Python (1).ipynb

Every line of code from the course notebook (verbatim).

# --- Code cell 3 ---
# Install required libraries (if not already installed)
!pip install requests
!pip install beautifulsoup4

# --- Code cell 10 ---
import requests
from bs4 import BeautifulSoup

# --- Code cell 11 ---
url="https://books.toscrape.com/catalogue/category/books_1/index.html"
response=requests.get(url)

# --- Code cell 12 ---
response

# --- Code cell 13 ---
response=response.content
soup=BeautifulSoup(response,"html.parser")
soup

# --- Code cell 14 ---
ol=soup.find("ol")
articles=ol.find_all("article",class_="product_pod")

# --- Code cell 15 ---
articles

# --- Code cell 16 ---
for article in articles:
  image=article.find("img")
  title=image.attrs["alt"]
  print(title)

# --- Code cell 17 ---
star=article.find("p")
print(star)

# --- Code cell 18 ---
star=article.find("p")
star=star["class"][1]
print(star)

# --- Code cell 19 ---
for article in articles:
  image=article.find("img")
  title=image.attrs["alt"]
  star=article.find("p")
  star=star["class"][1]
  print(title)
  print(star)

# --- Code cell 20 ---
price=article.find("p",class_="price_color")
print(price)

# --- Code cell 21 ---
for article in articles:
  image=article.find("img")
  title=image.attrs["alt"]
  star=article.find("p")
  star=star["class"][1]
  price=article.find("p",class_="price_color")
  print(price)

# --- Code cell 22 ---
for article in articles:
  image=article.find("img")
  title=image.attrs["alt"]
  star=article.find("p")
  star=star["class"][1]
  price=article.find("p",class_="price_color").text
  print(price)

# --- Code cell 23 ---
price=article.find("p",class_="price_color").text
print(price[1:])

# --- Code cell 24 ---
for article in articles:
  image=article.find("img")
  title=image.attrs["alt"]
  star=article.find("p")
  star=star["class"][1]
  price=article.find("p",class_="price_color").text
  price=float(price[1:])
  print(price)

# --- Code cell 25 ---
books=[]
for article in articles:
  image=article.find("img")
  title=image.attrs["alt"]
  star=article.find("p")
  star=star["class"][1]
  price=article.find("p",class_="price_color").text
  price=float(price[1:])
  books.append([title,star,price])
print(books)

# --- Code cell 26 ---
books=[]
for i in range(1,51):
  url=f"https://books.toscrape.com/catalogue/page-{i}.html"
  response=requests.get(url)
  response=response.content
  soup=BeautifulSoup(response,"html.parser")
  ol=soup.find("ol")
  articles=ol.find_all("article",class_="product_pod")
  for article in articles:
    image=article.find("img")
    title=image.attrs["alt"]
    star=article.find("p")
    star=star["class"][1]
    price=article.find("p",class_="price_color").text
    price=float(price[1:])
    books.append([title,star,price])
print(books)

# --- Code cell 27 ---
import pandas as pd
df=pd.DataFrame(books,columns=["Title","Star","Rating"])
df

# --- Code cell 28 ---
# web scrapping via extension

# --- Code cell 29 ---
# https://www.youtube.com/watch?v=aClnnoQK9G0&pp=ygUdd2ViIHNjcmFwaW5nIGV4dGVuc2lvbiBjaHJvbWU%3D

Complete code from course notebook: Web_Scraping_Using_Python_new.ipynb

Every line of code from the course notebook (verbatim).

# --- Code cell 3 ---
# Install required libraries (if not already installed)
!pip install requests
!pip install beautifulsoup4

# --- Code cell 4 ---
import sys
print(sys.executable)
!{sys.executable} -m pip install beautifulsoup4

# --- Code cell 5 ---
# Import libraries
import requests
from bs4 import BeautifulSoup

# --- Code cell 8 ---
url = 'https://timesofindia.indiatimes.com/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# Print all h1 tags
headings = soup.find_all('h3')
for h in headings:
    print(h.text)

# --- Code cell 9 ---
soup

# --- Code cell 10 ---
data = []

links = soup.find_all('a')

for link in links:
    href = link.get('href')
    text = link.text.strip()

    if text and href:   # filter empty values
        data.append([text, href])

# --- Code cell 11 ---
import pandas as pd

df = pd.DataFrame(data, columns=["Title", "Link"])
df.sample(16)

# --- Code cell 13 ---
links = soup.find_all('a')
for link in links:
    href = link.get('href')
    text = link.text.strip()
    print(f"{text} → {href}")

# --- Code cell 16 ---
import requests
from bs4 import BeautifulSoup
import pandas as pd

# --- Code cell 17 ---
url = "https://www.worldometers.info/world-population/population-by-country/"
headers = {"User-Agent": "Mozilla/5.0"}

response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")

# --- Code cell 18 ---
tables = soup.find_all("table")
len(tables)

# --- Code cell 19 ---
tables[0]

# --- Code cell 20 ---
table = soup.find("table", id="example2")

# --- Code cell 21 ---
table = soup.find_all("table")[0]

# --- Code cell 22 ---
headers = []

first_row = table.find_all("tr")[0]

for th in first_row.find_all("th"):
    headers.append(th.text.strip())

headers

# --- Code cell 23 ---
rows = []

for tr in table.find_all("tr")[1:]:
    tds = tr.find_all("td")
    if tds:
        rows.append([td.text.strip() for td in tds])

# --- Code cell 24 ---
import pandas as pd

df = pd.DataFrame(rows, columns=headers)
df.sample(10)

# --- Code cell 25 ---
df.shape

# --- Code cell 28 ---
url = "https://www.imdb.com/chart/top/"
headers = {
    "User-Agent": "Mozilla/5.0"
}

response = requests.get(url, headers=headers)
print(response.status_code)

# --- Code cell 29 ---
soup = BeautifulSoup(response.text, "html.parser")

# --- Code cell 30 ---
movies = soup.find_all("li", class_="ipc-metadata-list-summary-item")
len(movies)

# --- Code cell 31 ---
data = []

for movie in movies:
    # Title
    title = movie.find("h3").get_text(strip=True)

    # Year (robust)
    year = None
    for span in movie.find_all("span"):
        txt = span.get_text(strip=True)
        if txt.isdigit() and len(txt) == 4:
            year = txt
            break

    # Rating (robust)
    rating = None
    rating_tag = movie.find("span", attrs={"aria-label": True})
    if rating_tag:
        rating = rating_tag["aria-label"].split()[0]

    data.append([title, year, rating])

# --- Code cell 32 ---
import pandas as pd

df = pd.DataFrame(data, columns=["Title", "Year", "Rating"])
df