Search This Blog

Automating Web Scraping with Python

 

Automating Web Scraping with Python

Web scraping is the process of extracting data from websites. Python provides powerful libraries such as BeautifulSoup, Selenium, and Scrapy for automating web data extraction.


Installing Required Libraries

To perform web scraping, install the necessary libraries:

pip install requests beautifulsoup4 selenium
  • requests: Fetches web pages.
  • BeautifulSoup: Parses HTML and extracts data.
  • Selenium: Automates web interactions (useful for JavaScript-heavy sites).

Scraping Static Websites with BeautifulSoup

Fetching Web Page Content

import requests
from bs4 import BeautifulSoup

# Define the target URL
url = "https://example.com"

# Send an HTTP request
response = requests.get(url)

# Parse the page content
soup = BeautifulSoup(response.text, "html.parser")

# Print the page title
print(soup.title.text)

Extracting Specific Elements

# Extract all headings (h1)
headings = soup.find_all("h1")
for heading in headings:
    print(heading.text)
# Extract all links
links = soup.find_all("a")
for link in links:
    print(link.get("href"))
# Extract data from a specific class
paragraphs = soup.find_all("p", class_="content")
for para in paragraphs:
    print(para.text)

Scraping Dynamic Websites with Selenium

Some websites load content dynamically using JavaScript, which requests and BeautifulSoup cannot scrape directly. Selenium automates browser interactions to extract such data.

Setting Up Selenium

pip install webdriver-manager
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By

# Setup WebDriver
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service)

# Open a website
driver.get("https://example.com")

# Extract page title
print(driver.title)

# Close browser
driver.quit()

Extracting Elements with Selenium

# Open website
driver.get("https://example.com")

# Find elements by tag name
headings = driver.find_elements(By.TAG_NAME, "h1")
for heading in headings:
    print(heading.text)

# Find elements by class name
paragraphs = driver.find_elements(By.CLASS_NAME, "content")
for para in paragraphs:
    print(para.text)

Handling JavaScript-Rendered Content

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Wait for an element to load
element = WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((By.CLASS_NAME, "dynamic-content"))
)
print(element.text)

Scraping Tables and Exporting Data

Extracting Table Data from a Web Page

table = soup.find("table")
rows = table.find_all("tr")

for row in rows:
    columns = row.find_all("td")
    print([col.text.strip() for col in columns])

Saving Scraped Data to a CSV File

import csv

# Open CSV file for writing
with open("scraped_data.csv", "w", newline="") as file:
    writer = csv.writer(file)
    writer.writerow(["Column1", "Column2", "Column3"])  # Column headers

    for row in rows:
        columns = row.find_all("td")
        writer.writerow([col.text.strip() for col in columns])

print("Data saved to scraped_data.csv")

Saving Scraped Data to an Excel File

import pandas as pd

# Convert table data to DataFrame
data = []
for row in rows:
    columns = row.find_all("td")
    data.append([col.text.strip() for col in columns])

df = pd.DataFrame(data, columns=["Column1", "Column2", "Column3"])
df.to_excel("scraped_data.xlsx", index=False)

print("Data saved to scraped_data.xlsx")

Automating Web Scraping Reports

Scraping and Emailing Data

import yagmail

# Send scraped data via email
yag = yagmail.SMTP("your_email@gmail.com", "your_password")
yag.send(
    to="recipient@example.com",
    subject="Scraped Data Report",
    contents="Here is the latest web scraped data.",
    attachments="scraped_data.xlsx"
)

print("Scraped data emailed successfully.")

Conclusion

This section covered automating web scraping using requests, BeautifulSoup, and Selenium, along with exporting data to CSV/Excel and emailing reports. These techniques are useful for data aggregation, price monitoring, and content extraction.

Would you like additional examples or modifications?

Popular Posts