📖 OCR with pytesseract: Extracting Text from Images

Optical Character Recognition (OCR) is a technology used to extract text from images, scanned documents, or photos. In Python, the pytesseract library makes it easy to perform OCR tasks using Tesseract, an open-source OCR engine. Whether you’re working with receipts, forms, or any other text-based images, pytesseract is an effective and straightforward tool to recognize and extract text.

In this blog post, we’ll cover:

✅ What is pytesseract?
✅ How to install and set it up
✅ How to extract text from an image using pytesseract
✅ Tips for improving OCR accuracy

🧰 What You’ll Need

Python 3.x
pytesseract library installed (which you can do via pip install pytesseract)
Tesseract OCR engine installed
An image with text to extract

📦 Installing pytesseract

To get started with pytesseract, you first need to install it:

1. Install pytesseract

pip install pytesseract

2. Install Tesseract OCR Engine

Tesseract is the engine that performs the OCR operation. You need to install it separately, as pytesseract is just a Python wrapper for Tesseract.

For Windows:

Download the installer from the official repository: Tesseract Downloads. After installation, make sure the tesseract.exe path is added to your system’s PATH environment variable.

For macOS:

You can install Tesseract via Homebrew:

brew install tesseract

For Linux (Ubuntu/Debian):

Use the following command to install Tesseract:

sudo apt install tesseract-ocr

📷 Using pytesseract for OCR

Once you have everything set up, you can start extracting text from images. Let’s take a look at how to use pytesseract for OCR.

1. Basic Example of Text Extraction

Here’s a simple example to extract text from an image using pytesseract:

import pytesseract
from PIL import Image

# Open an image file
img = Image.open("example_image.png")

# Use pytesseract to extract text
extracted_text = pytesseract.image_to_string(img)

# Print the extracted text
print(extracted_text)

Explanation:

Image.open(): Opens the image file using Pillow.
pytesseract.image_to_string(): Extracts the text from the image using the Tesseract OCR engine.

This will print out any text detected in the image.

🛠️ Preprocessing the Image for Better OCR Accuracy

OCR accuracy depends a lot on the quality of the image. To improve the text extraction results, you can preprocess the image before passing it to Tesseract.

1. Grayscale Conversion

Converting the image to grayscale can help Tesseract to focus on the text, making it easier to detect.

img = img.convert('L')  # Convert the image to grayscale

2. Thresholding

You can use thresholding to convert the image into a binary image (black and white). This improves the contrast between the text and the background.

import cv2
import numpy as np

# Convert image to numpy array
img_np = np.array(img)

# Apply binary thresholding
_, img_bin = cv2.threshold(img_np, 150, 255, cv2.THRESH_BINARY)

# Convert back to Image object
img_bin = Image.fromarray(img_bin)

# Extract text from the preprocessed image
extracted_text = pytesseract.image_to_string(img_bin)
print(extracted_text)

3. Denoising the Image

If the image contains noise or irregularities, applying a denoising filter can improve OCR accuracy.

img = img.filter(ImageFilter.MedianFilter(3))  # Apply a median filter for noise reduction

📝 Working with Different Languages

Tesseract supports multiple languages. To perform OCR in a different language (such as Spanish or French), you need to install the relevant language package for Tesseract.

1. Installing Additional Language Packs

To install additional language packs, you can use the following commands:

For Windows (use the installer to select additional languages)

Go to the Tesseract installation directory and download language packs from Tesseract Languages.

For Linux (Ubuntu/Debian):

sudo apt install tesseract-ocr-spa  # For Spanish
sudo apt install tesseract-ocr-fra  # For French

For macOS:

brew install tesseract-lang  # Install multiple languages via Homebrew

2. Using the Language Parameter in pytesseract

Once you have the language installed, you can specify it when calling image_to_string():

extracted_text = pytesseract.image_to_string(img, lang='spa')  # For Spanish
print(extracted_text)

You can replace 'spa' with any other language code (e.g., 'eng' for English, 'fra' for French).

⚙️ Advanced: Working with OCR Output

Tesseract can return more than just the raw text. You can also extract additional information, such as the layout of the image, the bounding boxes around the text, and more.

1. Extracting Text with Bounding Boxes

You can use image_to_boxes() to get the bounding box for each character:

boxes = pytesseract.image_to_boxes(img)
print(boxes)

This will print the bounding boxes for each character detected in the image.

2. Extracting Text with Page Segmentation Mode

Tesseract allows you to specify the page segmentation mode (PSM), which can affect how the text is detected (e.g., single word, sparse text, etc.).

custom_config = r'--psm 6'  # Treat the image as a single block of text
extracted_text = pytesseract.image_to_string(img, config=custom_config)
print(extracted_text)

You can experiment with different PSM values (ranging from 0 to 13) depending on your image's structure.

🧠 Final Thoughts

pytesseract is a powerful tool for extracting text from images using OCR, and it can be further enhanced with image preprocessing techniques. It’s widely used for tasks such as digitizing scanned documents, extracting data from forms or receipts, and processing images for machine learning applications.

💡 Use Cases:

Scanned Document OCR: Convert scanned paper documents into editable text.
Automated Data Entry: Extract text from invoices, receipts, or forms for automatic data entry.
Image-to-Text Conversion: Extract text from screenshots, photos, or other images with text.

deltagradient