Comparing AI PDF Parsers

By: Brian McGough 2024-06-03

Alternate title: Peter Piper Picked a Python PDF Parser

Intro

When programmatically analyzing PDF documents, the first step generally is parsing the PDF itself. If you’ve ever experimented with PDF parsing packages, you probably noticed a significant disparity in the text produced by different packages for a single given PDF. This article and the provided Colab notebook should help understand why.

About PDFs

PDFs are complex documents that can contain a variety of elements, such as text, images, and interactive elements. At a technical level, PDFs are constructed using a language called PostScript, which describes the layout and graphics of the document. This complexity necessitates different approaches for extracting data effectively, depending on the document’s content.

Summary of findings

Parsing Packages: PDF parsing packages, like PDFMiner.six, PyPDF, and others, are highly effective for extracting text and structured data directly from the PostScript contents of a PDF. These tools are designed to interpret the underlying structure of the PDF, making them reliable for well-formatted, text-heavy documents. However, they cannot interpret or extract text from images embedded within the PDF. They also struggle with unstructured data and complex layouts, which can result in missed or jumbled text.

OCR Tools: Optical Character Recognition (OCR) tools, such as Pytesseract, Surya, and EasyOCR, are effective in scenarios where the PDF contains scanned images or photographs of text. These tools analyze the images to recognize and convert text into a machine-readable format. OCR is invaluable for digitizing printed documents and handling cases where text is not accessible to traditional parsers. However, OCR can be less accurate with poor image quality and typically has a longer processing time compared to parsing packages.

AI Tools: AI tools like ChatGPT-4 Vision use machine learning models to understand and extract data from both text and images. These tools can handle highly unstructured data, recognize complex patterns, and interpret visual elements within the PDF. While AI tools offer the most comprehensive extraction capabilities, they are generally more resource-intensive, less deterministic, and come with higher costs. They also require access to computational resources and may have varying accuracy depending on the specific content and layout of the PDF.

In the end, the choice between parsing packages, OCR, and AI tools depends on the specific requirements of the PDF documents you are working with. Parsing packages are ideal for structured, text-based PDFs, OCR tools are best for image-based text extraction, and AI tools can be useful for highly unstructured and complex documents. Combining these approaches can often yield the most reliable results.

Instructions for running notebook

Access the provided Colab link.
Import a PDF that is representative of your use case to the Colab file system.
If you want to test ChatGPT Vision and/or LlamaParse, make sure you have an OPENAI_API_KEY and LLAMAPARSE_API_KEY added to your Colab secrets (instructions here)
Update the pdf_url variable to point to your PDF
Run all cells
Compare the printout of each package with what you would expect to see returned
Profit

Comparison of tools for different PDF types

Preface: these results are general, subjective, and not meant to be comprehensive. If you are planning on creating a production system using one of these tools, I would suggest that you run sample PDFs that your application will be consuming through the provided notebook to determine what tool is most appropriate for your use case.

PDF #1: “Get Started with Smallpdf” (Basic)

Parsing Libraries:

PyPDF: 6/10 — Captures all text, but does not organize logically
PDFMiner: 10/10 —Captures all text, organizes logically
Tabula: 0/10 — Returns an empty list
PDFQuery: 10/10 — Captures all text, organizes logically
PyMuPDF: 6/10 — Captures all text, unable to organize logically
LlamaParse: 10/10 — Captures all text, organizes logically, even represents text placement well

OCR:

Pytesseract: 6/10 — Recognizes most characters well, but gets confused by some of the images behind text, has difficulty separating side by side elements. Decent run time.
Surya: 8/10 — Slow run time compared to Pytesseract/EasyOCR, good results otherwise.
Doctr: 5/10 — Seems to have missed a lot of text, maybe I did something wrong.
EasyOCR: 9/10 — Similar to Pytesseract results.

LLMs:

GPT-4: 10/10 — Captured all characters, divided content logically
GPT-4 Vision: 10/10 — Captured all characters, divided content appropriately.

PDF #2 Scientific Paper (Highlighted text and side-by-side sections)

Parsing Libraries:

PyPDF: 6/10 — Recognized characters well, but was not able to logically separate sections.
PDFMiner: 8/10 — Recognized characters well, separated sections in a way reflective of the document itself.
Tabula: 3/10 — Recognized characters well, but returned a jumble of disorganized text.
PDFQuery: 6/10 — Almost exact same return value as PyPDF.
PyMuPDF: 6/10 — Almost the same as PyPDF & PDFQuery.
LlamaParse: 9/10 — Recognized characters well, separated sections effectively & logically.

OCR:

Pytesseract: 6/10 — Accurate character recognition, did not recognize highlighted text.
Surya: 4/10 — Accurate character recognition, jumbles side-by-side text to the point of being unintelligible.
EasyOCR: 3/10 — Accurate character recognition, jumbles side-by-side text to the point of being unintelligible.
Doctr: 6/10 — Accurate character recognition, had difficulty with logical breaks.

LLMs:

GPT-4: 4/10 — Did well on the characters it did analyze, but left out a large chunk of the document.
GPT-4 Vision: 5/10 — Did well on characters it analyzed, and captured more than GPT-4, but also left out a large chunk of the document. Upon running multiple times, got very different responses.

PDF #3 Complicated Physical Mailer Image (Converted to PDF)

Parsing Libraries:

PyPDF: 0/10 — Recognizes no characters.
PDFMiner: 0/10 — Recognizes no characters.
Tabula: 0/10 — Recognizes no characters.
PDFQuery: 0/10 — Recognizes no characters.
PyMuPDF: 0/10 — Recognizes no characters.
LlamaParse: 5/10 — Recognizes some characters, but jumbles them up.

OCR:

Pytesseract: 0/10 — Printed gibberish.
Surya: 2/10 — Printed some correct characters, mostly wrong.
EasyOCR: 0/10 — Printed a long string of numbers.
Doctr: 0/10 — Printed gibberish.

LLMs:

GPT-4: 5/10 — Printed accurate text for part of the image, left the rest out.
GPT-4 Vision: 6/10 — Captured more text than GPT-4.

Tools Compared:

PDF Parsing and Extraction Libraries:

PDFMiner.six (GitHub)
Tabula (PyPI)
PDFQuery (PyPI)
PyMuPDF (PyPI)
PyPDF (PyPI)- In case you’re considering PyPDF2 or similar, I’d suggest reading this comparison.
LlamaParse (PyPI)

OCR:

Pytesseract (PyPI)
Surya (PyPI)
Doctr (PyPI)
EasyOCR (PyPI)

AI:

ChatGPT-4 Turbo/ChatGPT-4 Vision (OpenAI)

Google Colab notebook for tool comparison

About the Author: Brian McGough | AI Consultant and Software Engineer | World Traveler | Brian is available for AI engineering projects, and am especially interested in building RAG solutions.

Original Article on Medium