Search This Blog

Automating PDF Files with Python

 

Automating PDF Files with Python

PDF files are commonly used for sharing reports, invoices, and forms. Python provides libraries such as PyPDF2, pdfplumber, and ReportLab to automate tasks like extracting text, merging, splitting, and creating PDFs.


Installing Required Libraries

To work with PDFs in Python, install the necessary libraries using pip:

pip install PyPDF2 pdfplumber reportlab
  • PyPDF2: Extracts text, merges, splits, and modifies PDFs.
  • pdfplumber: Extracts text and tables from PDFs more accurately.
  • reportlab: Creates and modifies PDFs from scratch.

Reading PDF Files

Extracting Text from a PDF

import PyPDF2

# Open the PDF file
with open("document.pdf", "rb") as pdf_file:
    reader = PyPDF2.PdfReader(pdf_file)
    
    # Extract text from each page
    for page in reader.pages:
        print(page.extract_text())

Extracting Text Using pdfplumber (More Accurate for Tables)

import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    for page in pdf.pages:
        print(page.extract_text())

Extracting Data from PDF Tables

import pdfplumber
import pandas as pd

with pdfplumber.open("table_document.pdf") as pdf:
    table_data = pdf.pages[0].extract_table()
    
    # Convert table data into a DataFrame
    df = pd.DataFrame(table_data[1:], columns=table_data[0])
    print(df)

Merging and Splitting PDFs

Merging Multiple PDFs

import PyPDF2

pdf_files = ["file1.pdf", "file2.pdf"]
merger = PyPDF2.PdfMerger()

for pdf in pdf_files:
    merger.append(pdf)

merger.write("merged_document.pdf")
merger.close()

Splitting a PDF into Separate Pages

import PyPDF2

with open("document.pdf", "rb") as pdf_file:
    reader = PyPDF2.PdfReader(pdf_file)
    
    for i, page in enumerate(reader.pages):
        writer = PyPDF2.PdfWriter()
        writer.add_page(page)
        
        with open(f"page_{i+1}.pdf", "wb") as output_pdf:
            writer.write(output_pdf)

Extracting Specific Pages from a PDF

import PyPDF2

with open("document.pdf", "rb") as pdf_file:
    reader = PyPDF2.PdfReader(pdf_file)
    writer = PyPDF2.PdfWriter()
    
    # Extract pages 2 to 4
    for i in range(1, 4):  # Indexing starts at 0
        writer.add_page(reader.pages[i])

    with open("extracted_pages.pdf", "wb") as output_pdf:
        writer.write(output_pdf)

Adding Watermarks to PDFs

import PyPDF2

# Open original PDF and watermark PDF
with open("document.pdf", "rb") as pdf_file, open("watermark.pdf", "rb") as watermark_file:
    reader = PyPDF2.PdfReader(pdf_file)
    watermark = PyPDF2.PdfReader(watermark_file).pages[0]
    writer = PyPDF2.PdfWriter()

    # Apply watermark to each page
    for page in reader.pages:
        page.merge_page(watermark)
        writer.add_page(page)

    with open("watermarked_document.pdf", "wb") as output_pdf:
        writer.write(output_pdf)

Creating a PDF from Scratch

Generating a Simple PDF with Text

from reportlab.pdfgen import canvas

pdf = canvas.Canvas("generated.pdf")
pdf.drawString(100, 750, "Hello, this is an automated PDF!")
pdf.save()

Creating a PDF with Custom Formatting

from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas

pdf = canvas.Canvas("styled.pdf", pagesize=letter)
pdf.setFont("Helvetica-Bold", 14)
pdf.drawString(100, 750, "Python Automated PDF Report")

pdf.setFont("Helvetica", 12)
pdf.drawString(100, 730, "This report was generated using Python and ReportLab.")

pdf.save()

Extracting Images from PDFs

import pdfplumber

with pdfplumber.open("document_with_images.pdf") as pdf:
    for i, page in enumerate(pdf.pages):
        for j, image in enumerate(page.images):
            with open(f"image_{i}_{j}.png", "wb") as img_file:
                img_file.write(image["stream"].get_data())

Conclusion

This section covered automating PDF tasks, including extracting text and tables, merging, splitting, watermarking, and generating PDFs from scratch. These techniques are useful for automating document workflows, generating reports, and extracting structured data from PDFs.

Would you like additional examples or modifications?

Popular Posts