📖 OCR with pytesseract: Extracting Text from Images
Optical Character Recognition (OCR) is a technology used to extract text from images, scanned documents, or photos. In Python, the pytesseract library makes it easy to perform OCR tasks using Tesseract, an open-source OCR engine. Whether you’re working with receipts, forms, or any other text-based images, pytesseract is an effective and straightforward tool to recognize and extract text.
In this blog post, we’ll cover:
✅ What is pytesseract?
✅ How to install and set it up
✅ How to extract text from an image using pytesseract
✅ Tips for improving OCR accuracy
🧰 What You’ll Need
-
Python 3.x
-
pytesseract library installed (which you can do via
pip install pytesseract
) -
Tesseract OCR engine installed
-
An image with text to extract
📦 Installing pytesseract
To get started with pytesseract, you first need to install it:
1. Install pytesseract
pip install pytesseract
2. Install Tesseract OCR Engine
Tesseract is the engine that performs the OCR operation. You need to install it separately, as pytesseract is just a Python wrapper for Tesseract.
For Windows:
Download the installer from the official repository: Tesseract Downloads. After installation, make sure the tesseract.exe
path is added to your system’s PATH environment variable.
For macOS:
You can install Tesseract via Homebrew:
brew install tesseract
For Linux (Ubuntu/Debian):
Use the following command to install Tesseract:
sudo apt install tesseract-ocr
📷 Using pytesseract for OCR
Once you have everything set up, you can start extracting text from images. Let’s take a look at how to use pytesseract for OCR.
1. Basic Example of Text Extraction
Here’s a simple example to extract text from an image using pytesseract:
import pytesseract
from PIL import Image
# Open an image file
img = Image.open("example_image.png")
# Use pytesseract to extract text
extracted_text = pytesseract.image_to_string(img)
# Print the extracted text
print(extracted_text)
Explanation:
-
Image.open()
: Opens the image file using Pillow. -
pytesseract.image_to_string()
: Extracts the text from the image using the Tesseract OCR engine.
This will print out any text detected in the image.
🛠️ Preprocessing the Image for Better OCR Accuracy
OCR accuracy depends a lot on the quality of the image. To improve the text extraction results, you can preprocess the image before passing it to Tesseract.
1. Grayscale Conversion
Converting the image to grayscale can help Tesseract to focus on the text, making it easier to detect.
img = img.convert('L') # Convert the image to grayscale
2. Thresholding
You can use thresholding to convert the image into a binary image (black and white). This improves the contrast between the text and the background.
import cv2
import numpy as np
# Convert image to numpy array
img_np = np.array(img)
# Apply binary thresholding
_, img_bin = cv2.threshold(img_np, 150, 255, cv2.THRESH_BINARY)
# Convert back to Image object
img_bin = Image.fromarray(img_bin)
# Extract text from the preprocessed image
extracted_text = pytesseract.image_to_string(img_bin)
print(extracted_text)
3. Denoising the Image
If the image contains noise or irregularities, applying a denoising filter can improve OCR accuracy.
img = img.filter(ImageFilter.MedianFilter(3)) # Apply a median filter for noise reduction
📝 Working with Different Languages
Tesseract supports multiple languages. To perform OCR in a different language (such as Spanish or French), you need to install the relevant language package for Tesseract.
1. Installing Additional Language Packs
To install additional language packs, you can use the following commands:
For Windows (use the installer to select additional languages)
-
Go to the Tesseract installation directory and download language packs from Tesseract Languages.
For Linux (Ubuntu/Debian):
sudo apt install tesseract-ocr-spa # For Spanish
sudo apt install tesseract-ocr-fra # For French
For macOS:
brew install tesseract-lang # Install multiple languages via Homebrew
2. Using the Language Parameter in pytesseract
Once you have the language installed, you can specify it when calling image_to_string()
:
extracted_text = pytesseract.image_to_string(img, lang='spa') # For Spanish
print(extracted_text)
You can replace 'spa'
with any other language code (e.g., 'eng'
for English, 'fra'
for French).
⚙️ Advanced: Working with OCR Output
Tesseract can return more than just the raw text. You can also extract additional information, such as the layout of the image, the bounding boxes around the text, and more.
1. Extracting Text with Bounding Boxes
You can use image_to_boxes()
to get the bounding box for each character:
boxes = pytesseract.image_to_boxes(img)
print(boxes)
This will print the bounding boxes for each character detected in the image.
2. Extracting Text with Page Segmentation Mode
Tesseract allows you to specify the page segmentation mode (PSM), which can affect how the text is detected (e.g., single word, sparse text, etc.).
custom_config = r'--psm 6' # Treat the image as a single block of text
extracted_text = pytesseract.image_to_string(img, config=custom_config)
print(extracted_text)
You can experiment with different PSM values (ranging from 0 to 13) depending on your image's structure.
🧠 Final Thoughts
pytesseract is a powerful tool for extracting text from images using OCR, and it can be further enhanced with image preprocessing techniques. It’s widely used for tasks such as digitizing scanned documents, extracting data from forms or receipts, and processing images for machine learning applications.
💡 Use Cases:
-
Scanned Document OCR: Convert scanned paper documents into editable text.
-
Automated Data Entry: Extract text from invoices, receipts, or forms for automatic data entry.
-
Image-to-Text Conversion: Extract text from screenshots, photos, or other images with text.