Automating PDF Files with Python
PDF files are commonly used for sharing reports, invoices, and forms. Python provides libraries such as PyPDF2
, pdfplumber
, and ReportLab
to automate tasks like extracting text, merging, splitting, and creating PDFs.
Installing Required Libraries
To work with PDFs in Python, install the necessary libraries using pip
:
pip install PyPDF2 pdfplumber reportlab
PyPDF2
: Extracts text, merges, splits, and modifies PDFs.pdfplumber
: Extracts text and tables from PDFs more accurately.reportlab
: Creates and modifies PDFs from scratch.
Reading PDF Files
Extracting Text from a PDF
import PyPDF2
# Open the PDF file
with open("document.pdf", "rb") as pdf_file:
reader = PyPDF2.PdfReader(pdf_file)
# Extract text from each page
for page in reader.pages:
print(page.extract_text())
Extracting Text Using pdfplumber (More Accurate for Tables)
import pdfplumber
with pdfplumber.open("document.pdf") as pdf:
for page in pdf.pages:
print(page.extract_text())
Extracting Data from PDF Tables
import pdfplumber
import pandas as pd
with pdfplumber.open("table_document.pdf") as pdf:
table_data = pdf.pages[0].extract_table()
# Convert table data into a DataFrame
df = pd.DataFrame(table_data[1:], columns=table_data[0])
print(df)
Merging and Splitting PDFs
Merging Multiple PDFs
import PyPDF2
pdf_files = ["file1.pdf", "file2.pdf"]
merger = PyPDF2.PdfMerger()
for pdf in pdf_files:
merger.append(pdf)
merger.write("merged_document.pdf")
merger.close()
Splitting a PDF into Separate Pages
import PyPDF2
with open("document.pdf", "rb") as pdf_file:
reader = PyPDF2.PdfReader(pdf_file)
for i, page in enumerate(reader.pages):
writer = PyPDF2.PdfWriter()
writer.add_page(page)
with open(f"page_{i+1}.pdf", "wb") as output_pdf:
writer.write(output_pdf)
Extracting Specific Pages from a PDF
import PyPDF2
with open("document.pdf", "rb") as pdf_file:
reader = PyPDF2.PdfReader(pdf_file)
writer = PyPDF2.PdfWriter()
# Extract pages 2 to 4
for i in range(1, 4): # Indexing starts at 0
writer.add_page(reader.pages[i])
with open("extracted_pages.pdf", "wb") as output_pdf:
writer.write(output_pdf)
Adding Watermarks to PDFs
import PyPDF2
# Open original PDF and watermark PDF
with open("document.pdf", "rb") as pdf_file, open("watermark.pdf", "rb") as watermark_file:
reader = PyPDF2.PdfReader(pdf_file)
watermark = PyPDF2.PdfReader(watermark_file).pages[0]
writer = PyPDF2.PdfWriter()
# Apply watermark to each page
for page in reader.pages:
page.merge_page(watermark)
writer.add_page(page)
with open("watermarked_document.pdf", "wb") as output_pdf:
writer.write(output_pdf)
Creating a PDF from Scratch
Generating a Simple PDF with Text
from reportlab.pdfgen import canvas
pdf = canvas.Canvas("generated.pdf")
pdf.drawString(100, 750, "Hello, this is an automated PDF!")
pdf.save()
Creating a PDF with Custom Formatting
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas
pdf = canvas.Canvas("styled.pdf", pagesize=letter)
pdf.setFont("Helvetica-Bold", 14)
pdf.drawString(100, 750, "Python Automated PDF Report")
pdf.setFont("Helvetica", 12)
pdf.drawString(100, 730, "This report was generated using Python and ReportLab.")
pdf.save()
Extracting Images from PDFs
import pdfplumber
with pdfplumber.open("document_with_images.pdf") as pdf:
for i, page in enumerate(pdf.pages):
for j, image in enumerate(page.images):
with open(f"image_{i}_{j}.png", "wb") as img_file:
img_file.write(image["stream"].get_data())
Conclusion
This section covered automating PDF tasks, including extracting text and tables, merging, splitting, watermarking, and generating PDFs from scratch. These techniques are useful for automating document workflows, generating reports, and extracting structured data from PDFs.
Would you like additional examples or modifications?