Search This Blog

Extracting Text from PDFs using PyPDF2 and pdfplumber

 

📄 Extracting Text from PDFs using PyPDF2 and pdfplumber

PDFs (Portable Document Format) are widely used for sharing documents, but extracting data from PDFs can be tricky, especially when you need to pull text out for analysis, processing, or reporting. Fortunately, Python offers some excellent libraries to handle PDF text extraction efficiently. Two popular libraries for this task are PyPDF2 and pdfplumber.

In this blog post, we’ll go over:

✅ How to extract text using PyPDF2
✅ How to extract text using pdfplumber
✅ Differences between the two libraries
✅ Practical examples of text extraction from PDFs


🧰 What You'll Need

  • Python 3.x

  • Knowledge of how to install Python libraries using pip

  • A PDF file to extract text from


📥 Installing PyPDF2 and pdfplumber

Before starting, you need to install both libraries. You can do so using pip:

pip install PyPDF2 pdfplumber

📄 Extracting Text Using PyPDF2

1. Introduction to PyPDF2

PyPDF2 is a Python library that can extract text and manipulate PDFs (like merging, splitting, rotating, etc.). While it is quite powerful, it works best with simple text extraction and does not handle complex layouts or images well.

2. Extracting Text from a PDF with PyPDF2

Here’s how to use PyPDF2 to extract text from a PDF:

import PyPDF2

# Open the PDF file
with open("example.pdf", "rb") as file:
    pdf_reader = PyPDF2.PdfReader(file)
    
    # Get the number of pages in the PDF
    num_pages = len(pdf_reader.pages)
    
    # Extract text from all pages
    text = ""
    for page_num in range(num_pages):
        page = pdf_reader.pages[page_num]
        text += page.extract_text()

    print(text)

Explanation:

  1. PyPDF2.PdfReader(file): Opens the PDF file.

  2. pdf_reader.pages: Returns a list of pages in the PDF.

  3. page.extract_text(): Extracts text from each page of the PDF.

Note: PyPDF2 works best with PDFs that contain simple, linear text. It may struggle with PDFs containing complex layouts, images, or tables.


📝 Extracting Text Using pdfplumber

1. Introduction to pdfplumber

pdfplumber is another Python library used to extract text from PDFs. Unlike PyPDF2, pdfplumber is designed to handle more complex PDF structures, including tables and multi-column layouts. It’s particularly useful when working with structured data embedded in tables or complicated formats.

2. Extracting Text from a PDF with pdfplumber

Here’s how you can use pdfplumber to extract text:

import pdfplumber

# Open the PDF file
with pdfplumber.open("example.pdf") as pdf:
    text = ""
    for page in pdf.pages:
        text += page.extract_text()
        
    print(text)

Explanation:

  1. pdfplumber.open("example.pdf"): Opens the PDF file.

  2. pdf.pages: Iterates over each page in the PDF.

  3. page.extract_text(): Extracts the text from each page of the PDF.

pdfplumber is more effective than PyPDF2 in extracting text from PDFs with complex layouts like multi-column documents or tables.


📊 Working with Tables in pdfplumber

If your PDF contains tables, pdfplumber makes it easy to extract table data. Here’s an example of extracting tables from a PDF:

import pdfplumber

with pdfplumber.open("example_with_tables.pdf") as pdf:
    # Extracting tables from the first page
    first_page = pdf.pages[0]
    tables = first_page.extract_tables()

    for table in tables:
        for row in table:
            print(row)

Explanation:

  • first_page.extract_tables(): Extracts all tables from the specified page.

  • table: Each table is a list of rows, where each row is a list of cell values.


🧠 Key Differences Between PyPDF2 and pdfplumber

Feature PyPDF2 pdfplumber
Text Extraction Works well with simple text-based PDFs. Works better with complex layouts and tables.
Tables Does not handle tables well. Can extract tables from PDF pages with high accuracy.
Layout Handling Struggles with multi-column layouts. Handles multi-column layouts and images better.
Image Handling Does not extract images. Allows image extraction as well.
Performance Faster for simpler PDFs. Slower for very large or complex PDFs due to detailed parsing.

🧩 Choosing the Right Library

  • Use PyPDF2 if your PDF contains mostly simple text and you need a lightweight solution.

  • Use pdfplumber if your PDF contains tables, multi-column layouts, or images that need to be handled for a more accurate extraction.


💡 Practical Example: Extracting Data from a PDF Table

Let’s say you have a PDF with a table of sales data. Here’s how you can extract it using pdfplumber:

import pdfplumber
import pandas as pd

# Open the PDF and extract the table
with pdfplumber.open("sales_data.pdf") as pdf:
    page = pdf.pages[0]  # Extract from the first page
    table = page.extract_tables()[0]  # Extract the first table

# Convert the table to a DataFrame
df = pd.DataFrame(table[1:], columns=table[0])  # Use the first row as the column headers

# Display the DataFrame
print(df)

Explanation:

  • extract_tables(): Extracts the table from the PDF.

  • pd.DataFrame(): Converts the extracted table into a Pandas DataFrame for easy data manipulation.

This approach works great for PDFs containing structured tabular data!


🧠 Final Thoughts

PyPDF2 and pdfplumber are both excellent tools for extracting text from PDFs, but each has its strengths. PyPDF2 is lightweight and great for simple PDFs, while pdfplumber shines when working with complex documents, especially those containing tables or multi-column layouts.

Before choosing a tool, consider the complexity of your PDF. If you're working with structured data like tables or need to preserve the document’s layout, pdfplumber is your best bet.

Popular Posts