Web Scraping Using Python
Extract data from websites automatically with requests and BeautifulSoup.
🕷️ What is Web Scraping?
Web scraping is automatically getting data from websites using code. Instead of copying and pasting by hand, we write a script that visits a page, reads its HTML, and pulls out the bits we want (headlines, prices, reviews, etc.). Data scientists use it when an API is not available.
Why it's useful: Gather data for analysis; monitor prices or news; automate repetitive web tasks.
📚 Libraries You Need
- requests — Downloads the raw HTML of a webpage (like opening the page in a browser and saving the source).
- beautifulsoup4 — Parses that HTML so we can search for tags, classes, and IDs and extract text or links.
Install once: pip install requests beautifulsoup4
requests
Fetches the page. You get back a Response object; .content gives the HTML bytes.
BeautifulSoup
Turns messy HTML into a structure you can search: find by tag name, class, or id and get text or attributes.
🔍 Static vs Dynamic Sites
Static: All content is in the HTML when the page loads (e.g. Wikipedia, many docs sites). You can scrape with requests + BeautifulSoup.
Dynamic: Content is loaded or updated with JavaScript after the page loads (e.g. many news feeds, social media). BeautifulSoup alone cannot see that content; you need something like Selenium or Playwright, or an API.
📝 Basic HTML (Simple Idea)
Web pages are built with HTML tags. For scraping you only need a rough idea:
<h1>…<h6>— Headings<p>— Paragraphs<a href="...">— Links<div>/<span>— Containers for grouping<table>,<tr>,<td>— Tables
BeautifulSoup lets you find these by name, or by class="..." / id="...".
✏️ Minimal Code (Same idea as the course notebook)
In the course source notebook (Web_Scraping_Using_Python), you'll see:
from bs4 import BeautifulSoup
url = "https://books.toscrape.com/catalogue/category/books_1/index.html"
response = requests.get(url) # Download the page
# response.status_code 200 means OK
html = response.content
soup = BeautifulSoup(html, "html.parser") # Parse HTML
# Then use soup.find("tag") or soup.find_all("tag") to get elements
In simple terms: requests.get(url) brings the page; BeautifulSoup(html, "html.parser") turns it into a searchable tree; then you use .find() or .find_all() to get the parts you need and .text or ["href"] to get text or links.
✅ Takeaway
- Web scraping = programmatically extracting data from websites.
- Use
requeststo fetch HTML and BeautifulSoup to parse and extract. - Works best on static pages; for JavaScript-heavy sites you may need Selenium/Playwright or an API.
Complete code from course notebook: Web_Scraping_Using_Python (1).ipynb
Every line of code from the course notebook (verbatim).
# --- Code cell 3 ---
# Install required libraries (if not already installed)
!pip install requests
!pip install beautifulsoup4
# --- Code cell 10 ---
import requests
from bs4 import BeautifulSoup
# --- Code cell 11 ---
url="https://books.toscrape.com/catalogue/category/books_1/index.html"
response=requests.get(url)
# --- Code cell 12 ---
response
# --- Code cell 13 ---
response=response.content
soup=BeautifulSoup(response,"html.parser")
soup
# --- Code cell 14 ---
ol=soup.find("ol")
articles=ol.find_all("article",class_="product_pod")
# --- Code cell 15 ---
articles
# --- Code cell 16 ---
for article in articles:
image=article.find("img")
title=image.attrs["alt"]
print(title)
# --- Code cell 17 ---
star=article.find("p")
print(star)
# --- Code cell 18 ---
star=article.find("p")
star=star["class"][1]
print(star)
# --- Code cell 19 ---
for article in articles:
image=article.find("img")
title=image.attrs["alt"]
star=article.find("p")
star=star["class"][1]
print(title)
print(star)
# --- Code cell 20 ---
price=article.find("p",class_="price_color")
print(price)
# --- Code cell 21 ---
for article in articles:
image=article.find("img")
title=image.attrs["alt"]
star=article.find("p")
star=star["class"][1]
price=article.find("p",class_="price_color")
print(price)
# --- Code cell 22 ---
for article in articles:
image=article.find("img")
title=image.attrs["alt"]
star=article.find("p")
star=star["class"][1]
price=article.find("p",class_="price_color").text
print(price)
# --- Code cell 23 ---
price=article.find("p",class_="price_color").text
print(price[1:])
# --- Code cell 24 ---
for article in articles:
image=article.find("img")
title=image.attrs["alt"]
star=article.find("p")
star=star["class"][1]
price=article.find("p",class_="price_color").text
price=float(price[1:])
print(price)
# --- Code cell 25 ---
books=[]
for article in articles:
image=article.find("img")
title=image.attrs["alt"]
star=article.find("p")
star=star["class"][1]
price=article.find("p",class_="price_color").text
price=float(price[1:])
books.append([title,star,price])
print(books)
# --- Code cell 26 ---
books=[]
for i in range(1,51):
url=f"https://books.toscrape.com/catalogue/page-{i}.html"
response=requests.get(url)
response=response.content
soup=BeautifulSoup(response,"html.parser")
ol=soup.find("ol")
articles=ol.find_all("article",class_="product_pod")
for article in articles:
image=article.find("img")
title=image.attrs["alt"]
star=article.find("p")
star=star["class"][1]
price=article.find("p",class_="price_color").text
price=float(price[1:])
books.append([title,star,price])
print(books)
# --- Code cell 27 ---
import pandas as pd
df=pd.DataFrame(books,columns=["Title","Star","Rating"])
df
# --- Code cell 28 ---
# web scrapping via extension
# --- Code cell 29 ---
# https://www.youtube.com/watch?v=aClnnoQK9G0&pp=ygUdd2ViIHNjcmFwaW5nIGV4dGVuc2lvbiBjaHJvbWU%3D
Complete code from course notebook: Web_Scraping_Using_Python_new.ipynb
Every line of code from the course notebook (verbatim).
# --- Code cell 3 ---
# Install required libraries (if not already installed)
!pip install requests
!pip install beautifulsoup4
# --- Code cell 4 ---
import sys
print(sys.executable)
!{sys.executable} -m pip install beautifulsoup4
# --- Code cell 5 ---
# Import libraries
import requests
from bs4 import BeautifulSoup
# --- Code cell 8 ---
url = 'https://timesofindia.indiatimes.com/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# Print all h1 tags
headings = soup.find_all('h3')
for h in headings:
print(h.text)
# --- Code cell 9 ---
soup
# --- Code cell 10 ---
data = []
links = soup.find_all('a')
for link in links:
href = link.get('href')
text = link.text.strip()
if text and href: # filter empty values
data.append([text, href])
# --- Code cell 11 ---
import pandas as pd
df = pd.DataFrame(data, columns=["Title", "Link"])
df.sample(16)
# --- Code cell 13 ---
links = soup.find_all('a')
for link in links:
href = link.get('href')
text = link.text.strip()
print(f"{text} → {href}")
# --- Code cell 16 ---
import requests
from bs4 import BeautifulSoup
import pandas as pd
# --- Code cell 17 ---
url = "https://www.worldometers.info/world-population/population-by-country/"
headers = {"User-Agent": "Mozilla/5.0"}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
# --- Code cell 18 ---
tables = soup.find_all("table")
len(tables)
# --- Code cell 19 ---
tables[0]
# --- Code cell 20 ---
table = soup.find("table", id="example2")
# --- Code cell 21 ---
table = soup.find_all("table")[0]
# --- Code cell 22 ---
headers = []
first_row = table.find_all("tr")[0]
for th in first_row.find_all("th"):
headers.append(th.text.strip())
headers
# --- Code cell 23 ---
rows = []
for tr in table.find_all("tr")[1:]:
tds = tr.find_all("td")
if tds:
rows.append([td.text.strip() for td in tds])
# --- Code cell 24 ---
import pandas as pd
df = pd.DataFrame(rows, columns=headers)
df.sample(10)
# --- Code cell 25 ---
df.shape
# --- Code cell 28 ---
url = "https://www.imdb.com/chart/top/"
headers = {
"User-Agent": "Mozilla/5.0"
}
response = requests.get(url, headers=headers)
print(response.status_code)
# --- Code cell 29 ---
soup = BeautifulSoup(response.text, "html.parser")
# --- Code cell 30 ---
movies = soup.find_all("li", class_="ipc-metadata-list-summary-item")
len(movies)
# --- Code cell 31 ---
data = []
for movie in movies:
# Title
title = movie.find("h3").get_text(strip=True)
# Year (robust)
year = None
for span in movie.find_all("span"):
txt = span.get_text(strip=True)
if txt.isdigit() and len(txt) == 4:
year = txt
break
# Rating (robust)
rating = None
rating_tag = movie.find("span", attrs={"aria-label": True})
if rating_tag:
rating = rating_tag["aria-label"].split()[0]
data.append([title, year, rating])
# --- Code cell 32 ---
import pandas as pd
df = pd.DataFrame(data, columns=["Title", "Year", "Rating"])
df