Post

From PDF to Data: A Practical Guide to marker and LLMs

Learn how to extract structured data from PDFs using marker and LLMs. A practical guide to building cost-effective document processing pipelines with Claude, Gemini, and local models.

From PDF to Data: A Practical Guide to marker and LLMs

Getting structured, usable data out of PDF files is a notoriously difficult task. Whether you’re dealing with scanned invoices, academic papers, or cluttered reports, the process is often manual and painful. Enter marker, a powerful open-source tool that changes the game.

In this guide, we’ll walk through what marker is, how to supercharge it with Large Language Models like Gemini and Claude, and how to build a cost-effective pipeline for intelligent data extraction.

What is marker? The Fast, Local Converter

At its core, marker is a command-line tool that converts PDFs to clean, readable Markdown. Unlike simple text scrapers, it uses a pipeline of deep learning models to understand a document’s layout, handle columns, remove headers/footers, and format everything from tables to code blocks correctly.

The best part? Its default mode runs entirely on your local machine (CPU or GPU), making it incredibly fast and completely free.

Installation

Install marker via pip:

1
pip install marker-pdf

For GPU acceleration (recommended for large batches):

1
pip install marker-pdf[gpu]

Basic Usage

Convert a single PDF:

1
marker "path/to/your/document.pdf" "output.md"

For batch processing:

1
marker_single path/to/file.pdf output/ --parallel_factor 2

For many use cases, the default marker is all you need. It’s excellent for turning a PDF into a single text file you can easily read, search, or archive.

Supercharging Accuracy with LLMs

While the local models are fast, they can struggle with highly complex documents—especially those with intricate nested tables or handwritten notes. For these cases, marker can connect to an external LLM for a significant accuracy boost.

The tool defaults to using Google’s Gemini. Connecting it is simple:

  1. Get an API Key: Obtain an API key from Google AI Studio.
  2. Set the Environment Variable:
    1
    
     export GEMINI_API_KEY="YOUR_API_KEY_HERE"
    
  3. Run marker with the LLM flag:
    1
    
     marker "path/to/your/document.pdf" "output.md" --use_llm
    

This process is significantly slower and has an associated cost, as it sends your document to the LLM’s servers for processing. The trade-off is clear: speed and cost vs. maximum accuracy.

Beyond Conversion: The Smart Extraction Workflow

What if you don’t care about the full document? What if you just need to extract a few key pieces of information, like an invoice number or a mailing address?

Using an LLM to convert the entire document just to extract a tiny piece of data is inefficient and expensive. A much smarter workflow combines local processing with targeted API calls.

The Cost-Effective Method

  1. Process Locally (Free): Use marker’s fast, local mode to convert the PDF into Markdown.
    1
    
     marker "document.pdf" "output.md"
    
  2. Isolate Data Locally (Free): In a simple script, read the generated markdown and perform a basic search for keywords. Find the text surrounding “Invoice Number” or “Total Amount”.

  3. Extract with API (Cheap): Send only this small, relevant snippet to an LLM for precise extraction.

Example: Invoice Data Extraction with Claude

Here’s a practical Python example using Claude to extract invoice data:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
import anthropic
from pathlib import Path
import subprocess
import re

def extract_invoice_data(pdf_path: str) -> dict:
    # Step 1: Convert PDF to markdown locally (free)
    output_path = "temp_output.md"
    subprocess.run(["marker", pdf_path, output_path], check=True)

    markdown_content = Path(output_path).read_text()

    # Step 2: Find relevant sections (free)
    # Look for common invoice keywords and surrounding context
    patterns = [
        r"(?i)(invoice.{0,500})",
        r"(?i)(total.{0,200})",
        r"(?i)(bill to.{0,300})",
    ]

    relevant_text = ""
    for pattern in patterns:
        matches = re.findall(pattern, markdown_content)
        relevant_text += " ".join(matches)

    # Step 3: Extract structured data with Claude (cheap)
    client = anthropic.Anthropic()

    message = client.messages.create(
        model="claude-haiku-4-20250414",
        max_tokens=500,
        messages=[{
            "role": "user",
            "content": f"""Extract the following from this invoice text:
- Invoice number
- Date
- Total amount
- Vendor name

Return as JSON. Text:
{relevant_text[:2000]}"""
        }]
    )

    return message.content[0].text

# Usage
data = extract_invoice_data("invoice.pdf")
print(data)

This approach minimizes the data sent to the API, dramatically reducing latency and cost while still leveraging the power of LLMs for precise extraction.

Choosing Your Extraction Engine: A Cost Comparison

When it comes to the final extraction step, you have many LLMs to choose from. For tasks with large inputs (document text) and small outputs (extracted data), input token price is the most critical cost factor.

Here’s a comparison of popular, cost-effective models:

Provider Model Name Input Price (per 1M tokens) Best For
Anthropic Claude 4 Haiku $0.80 Best balance of cost and intelligence
Anthropic Claude 3.5 Haiku $0.80 Fast, capable, excellent for extraction tasks
Google Gemini 2.0 Flash $0.10 Cheapest option with good performance
OpenAI GPT-4o mini $0.15 Budget option from OpenAI
OpenAI GPT-4o $2.50 Top-tier performance, higher cost

Prices are approximate and subject to change. Always check official pricing pages.

For pure cost-effectiveness, Google’s Gemini Flash models offer the lowest prices. For the best balance of cost and capability, Claude Haiku is excellent for extraction tasks where you need reliable structured output.

Running Models Locally

You can also explore open-source models by running them yourself with tools like Ollama, which gives you ultimate privacy and zero API costs:

1
2
3
4
5
6
7
8
# Install Ollama
brew install ollama

# Pull a capable model
ollama pull llama3.2

# Use in your pipeline
ollama run llama3.2 "Extract the invoice number from: ..."

Local models work well for simple extraction tasks and offer unlimited usage at the cost of your own compute.

When to Use Each Approach

Scenario Recommended Approach
Simple text extraction marker local only
Complex tables, handwriting marker with --use_llm
Batch processing 100s of docs marker local + targeted API extraction
Maximum accuracy needed marker with LLM + Claude for extraction
Privacy-sensitive documents marker local + Ollama

Conclusion

By combining the high-speed local processing of marker with the precision of targeted LLM calls, you can build a powerful, flexible, and surprisingly cheap document processing pipeline. Start with the fastest, free tool for the job, and only bring in the powerful (and costly) models when and where you need them most.

The key insight: don’t send entire documents to LLMs when a few hundred tokens of context will do. Your wallet will thank you.

This post is licensed under CC BY 4.0 by the author.