Automating Web Scraping with Python
Web scraping is the process of extracting data from websites. Python provides powerful libraries such as BeautifulSoup
, Selenium
, and Scrapy
for automating web data extraction.
Installing Required Libraries
To perform web scraping, install the necessary libraries:
pip install requests beautifulsoup4 selenium
requests
: Fetches web pages.BeautifulSoup
: Parses HTML and extracts data.Selenium
: Automates web interactions (useful for JavaScript-heavy sites).
Scraping Static Websites with BeautifulSoup
Fetching Web Page Content
import requests
from bs4 import BeautifulSoup
# Define the target URL
url = "https://example.com"
# Send an HTTP request
response = requests.get(url)
# Parse the page content
soup = BeautifulSoup(response.text, "html.parser")
# Print the page title
print(soup.title.text)
Extracting Specific Elements
# Extract all headings (h1)
headings = soup.find_all("h1")
for heading in headings:
print(heading.text)
# Extract all links
links = soup.find_all("a")
for link in links:
print(link.get("href"))
# Extract data from a specific class
paragraphs = soup.find_all("p", class_="content")
for para in paragraphs:
print(para.text)
Scraping Dynamic Websites with Selenium
Some websites load content dynamically using JavaScript, which requests
and BeautifulSoup
cannot scrape directly. Selenium
automates browser interactions to extract such data.
Setting Up Selenium
pip install webdriver-manager
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
# Setup WebDriver
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service)
# Open a website
driver.get("https://example.com")
# Extract page title
print(driver.title)
# Close browser
driver.quit()
Extracting Elements with Selenium
# Open website
driver.get("https://example.com")
# Find elements by tag name
headings = driver.find_elements(By.TAG_NAME, "h1")
for heading in headings:
print(heading.text)
# Find elements by class name
paragraphs = driver.find_elements(By.CLASS_NAME, "content")
for para in paragraphs:
print(para.text)
Handling JavaScript-Rendered Content
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# Wait for an element to load
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CLASS_NAME, "dynamic-content"))
)
print(element.text)
Scraping Tables and Exporting Data
Extracting Table Data from a Web Page
table = soup.find("table")
rows = table.find_all("tr")
for row in rows:
columns = row.find_all("td")
print([col.text.strip() for col in columns])
Saving Scraped Data to a CSV File
import csv
# Open CSV file for writing
with open("scraped_data.csv", "w", newline="") as file:
writer = csv.writer(file)
writer.writerow(["Column1", "Column2", "Column3"]) # Column headers
for row in rows:
columns = row.find_all("td")
writer.writerow([col.text.strip() for col in columns])
print("Data saved to scraped_data.csv")
Saving Scraped Data to an Excel File
import pandas as pd
# Convert table data to DataFrame
data = []
for row in rows:
columns = row.find_all("td")
data.append([col.text.strip() for col in columns])
df = pd.DataFrame(data, columns=["Column1", "Column2", "Column3"])
df.to_excel("scraped_data.xlsx", index=False)
print("Data saved to scraped_data.xlsx")
Automating Web Scraping Reports
Scraping and Emailing Data
import yagmail
# Send scraped data via email
yag = yagmail.SMTP("your_email@gmail.com", "your_password")
yag.send(
to="recipient@example.com",
subject="Scraped Data Report",
contents="Here is the latest web scraped data.",
attachments="scraped_data.xlsx"
)
print("Scraped data emailed successfully.")
Conclusion
This section covered automating web scraping using requests
, BeautifulSoup
, and Selenium
, along with exporting data to CSV/Excel and emailing reports. These techniques are useful for data aggregation, price monitoring, and content extraction.
Would you like additional examples or modifications?