Welcome to Ed2Ti Blog.

My journey on IT is here.

PDF OCR with Python

- Posted in Programing - Python by

OCR (Optical Character Recognition) is a process of converting scanned images, PDFs, or other documents into editable text. In Python, there are several libraries available for OCR, including PyOCR, Tesseract OCR, and OCRopus. In this answer, we will use PyOCR and Pillow libraries to perform OCR on a PDF file.

First, we need to install the required libraries. You can install PyOCR and Pillow using pip:

pip install pyocr pillow

Next, we will write the code to perform OCR on the PDF file. Here is a sample code:

import io
import sys
import pyocr
import pyocr.builders
from PIL import Image
from pdf2image import convert_from_path

# Path of the PDF file
pdf_path = 'example.pdf'

# Convert PDF to PIL Image objects
pages = convert_from_path(pdf_path)

# OCR
tool = pyocr.get_available_tools()[0]
lang = tool.get_available_languages()[0]

for page in pages:
    txt = tool.image_to_string(
        Image.fromarray(page),
        lang=lang,
        builder=pyocr.builders.TextBuilder()
    )
    print(txt)

In this code, we first convert the PDF file to PIL Image objects using the pdf2image library. Then, we loop through each page of the PDF and perform OCR using the PyOCR library. Finally, we print the extracted text from each page.

Note that the OCR accuracy depends on the quality of the scanned image, the language of the text, and the font used in the document. Therefore, you may need to experiment with different OCR engines, languages, and settings to get the best results for your specific use case.