Ever opened a PDF and wished it could turn itself into clean, usable data? This article compares five leading document parsers, Gemini 2.5 Pro, Landing AI, LlamaParse, Dolphin, and Nanonets, using the same financial report, so you can see how they handle tables, headings, footnotes, and markdown.
You’ll learn which tools are fastest, which keep structure intact, what they really cost, and when self-hosting is worth it. By the end, you’ll know exactly which parser fits your stack, budget, and deadlines. And the timing couldn’t be better: the intelligent document processing market is growing fast, estimated at $2.30B in 2024 and $2.96B in 2025 per Grand View Research.
Backed by Andrew Ng, Landing AI has built its Agentic Document Extraction tool to simplify document parsing without complex setup. The platform offers a free playground where you can quickly test its capabilities, making it ideal for teams who want to experiment before committing.
Its well-structured documentation and fine-grained API controls make integration smooth for developers. While the output is generally accurate and reliable, users should note that pricing starts at $0.03 per page, which is manageable for smaller projects but may become costly at scale. Overall, it’s a strong choice for accuracy-focused, low-effort deployment.
Getting started with LandingAI is simple and only takes a few minutes:
import time
from agentic_doc.parse import parse
# Parse a local file
start = time.time()
result = parse("QuantumLeap Analytics Inc. - Q3 2024 Financial Report.pdf")
end = time.time()
# Get the extracted data as markdown
print("Extracted Markdown:")
print(result[0].markdown)
time_taken = end - start
print(f"Total time taken: {time_taken}")
Link: LandingAI output
Dolphin is an open-source document parsing model released by ByteDance and available on Hugging Face. It’s designed to handle standard PDFs and text-heavy documents reasonably well, but like many open-source parsers, it can stumble when faced with more complex structures such as nested tables, multi-column layouts, or documents with heavy formatting.
Being open source, Dolphin gives developers the flexibility to self-host, experiment, and customize workflows without recurring per-page costs. However, it does require a dedicated GPU (around 5.8 GB of VRAM), making it more suited to teams who are comfortable with infrastructure management.
In our benchmarks, Dolphin delivered decent accuracy on headings and simple tables, but occasionally misordered content or misformatted markdown in dense financial reports. On the plus side, it was fast (~7.1 seconds) compared to most commercial tools, making it attractive for projects where speed matters more than perfect structure.
Link: Dolphin by ByteDance Output Link
Experience seamless collaboration and exceptional results.
LlamaParse is the cloud-based document parsing solution offered by LlamaIndex. It’s designed to handle a wide variety of document types, from research papers and contracts to financial reports. One of its biggest advantages is accessibility, every user gets 10,000 free credits per month, making it easy to try before scaling.
Because it’s a cloud service, LlamaParse doesn’t require heavy GPU resources like Dolphin. Integration is straightforward, with a simple Python SDK and support for multiple file formats. In our tests, it managed standard PDFs well, preserving tables and headings, but struggled slightly with very complex documents (e.g., multi-level nested tables).
Best for: Developers and teams who want a lightweight, cost-effective parser with generous free usage, and don’t want to worry about GPU or infrastructure management.
from llama_cloud_services import LlamaParse
from dotenv import load_dotenv
import os
import time
load_dotenv()
parser = LlamaParse(
api_key=os.getenv("LLAMA_CLOUD_API_KEY"),
num_workers=4,
verbose=True,
language="en",
)
start = time.time()
result = parser.parse("QuantumLeap Analytics Inc. - Q3 2024 Financial Report.pdf")
end = time.time()
# Get the markdown documents from the result
markdown_documents = result.get_markdown_documents(split_by_page=True)
# Combine all markdown content
markdown_content = ""
for i, doc in enumerate(markdown_documents):
if i > 0:
markdown_content += "\n\n---\n\n" # Add page separator
markdown_content += doc.text
# Save to markdown file
with open("markdown_output_llama_parse.md", "w", encoding="utf-8") as f:
f.write(markdown_content)
print(f"Markdown content saved to markdown_output_llama_parse.md")
print(f"Time taken: {end - start:.2f} seconds")
print(f"Number of pages processed: {len(markdown_documents)}")
Link: LlamaParse Output
Nanonets-OCR-s is one of the most accurate open-source models we tested for document parsing. It consistently captured even the most complex tables, multi-level headings, and footnotes with high precision, making it stand out from other open-source alternatives like Dolphin.
The trade-off, however, is performance. Nanonets is very GPU-intensive, requiring around 17.7 GB of VRAM, and it runs noticeably slower compared to competitors (our benchmark clocked ~83 seconds per document). For teams with limited compute resources, this can be a bottleneck. On the other hand, if infrastructure isn’t a constraint, you can pair Nanonets with vLLM to dramatically speed up processing, though that’s a more advanced setup.
import os
from PIL import Image
from transformers import AutoTokenizer, AutoProcessor, AutoModelForImageTextToText
from pdf2image import convert_from_path
import torch
import sys
import time
OUTPUT_DIR = "output"
os.makedirs(OUTPUT_DIR, exist_ok=True)
# Load model
model_path = "nanonets/Nanonets-OCR-s"
model = AutoModelForImageTextToText.from_pretrained(model_path, torch_dtype="auto", device_map="auto")
model.eval()
tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = AutoProcessor.from_pretrained(model_path)
BASE_PROMPT = (
"Convert this document into a clean, structured markdown format with headings, lists, tables, "
"and proper indentation wherever relevant. Remove irrelevant artifacts or OCR errors."
)
def warmup_model():
"""Warmup the model to avoid flawed timings"""
print("Warming up model...")
# Create a small dummy image
dummy_image = Image.new("RGB", (224, 224), color="white")
dummy_prompt = "Extract text from this image."
# Run a few warmup iterations
for i in range(3):
try:
_ = process_image(dummy_image, dummy_prompt)
except Exception as e:
print(f"Warmup iteration {i+1} failed: {e}")
continue
print("Model warmup completed.")
def process_image(image: Image.Image, prompt: str):
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": prompt},
],
},
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], padding=True, return_tensors="pt").to(model.device)
output_ids = model.generate(**inputs, max_new_tokens=15000, do_sample=False)
generated_ids = [output_ids[len(input_ids) :] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
return output_text[0]
def handle_pdf(file_path: str):
images = convert_from_path(file_path, dpi=300)
all_text = ""
for image in images:
text = process_image(image, BASE_PROMPT)
all_text += text + "\n\n"
save_path = os.path.join(OUTPUT_DIR, os.path.splitext(os.path.basename(file_path))[0] + ".md")
with open(save_path, "w") as f:
f.write(all_text)
print(f"[✓] Processed PDF: {file_path} -> {save_path}")
return save_path
def main():
if len(sys.argv) != 2:
print("Usage: python run_ocr_nanonets.py <file.pdf>")
sys.exit(1)
file_path = sys.argv[1]
# Check if file exists
if not os.path.exists(file_path):
print(f"Error: File '{file_path}' not found.")
sys.exit(1)
# Check if file is PDF
ext = file_path.lower().split(".")[-1]
if ext != "pdf":
print("Error: Only PDF files are supported.")
sys.exit(1)
# Warmup the model
warmup_model()
print(f"Processing PDF: {file_path}")
start = time.time()
output_file = handle_pdf(file_path)
end = time.time()
print(f"Time taken: {end - start:.2f} seconds")
if __name__ == "__main__":
main()
Link: Nanonets Output
Experience seamless collaboration and exceptional results.
Gemini-2.5-pro by Google is arguably the most well-rounded solution for document parsing in 2025. It consistently delivered near-perfect results in our testing, accurately preserving headings, nested tables, footnotes, and complex document structures with minimal errors. Compared to other paid tools, Gemini stands out for its balance of accuracy, speed, and cost-effectiveness.
The best part is accessibility: you can test Gemini for free in Google AI Studio or directly through Gemini before moving into paid API usage. For developers, programmatic access is available with clear documentation, though it requires setting up a Google Cloud project and enabling the Gemini API.
Convert the contents of this PDF into well-formatted Markdown.
Preserve all structural elements like headings, lists, tables, and paragraphs.
Maintain the original formatting and hierarchy. Ensure the output is clean.
Link: Gemini-2.5-pro
Name | Type | Time (sec) | Observation | Use Case |
LandingAI | Paid | 41.9 | Presentation Issue otherwise good | Best for users needing a simple, out-of-the-box solution with good documentation |
Dolphin | Open-Source | 7.1 | Messed up markdown and order of heading | Best for those who need a fast, self-hosted solution and can tolerate some inaccuracies |
LlamaParse | Paid | 53.7 | Significant structural and data omission issues | Cheap solution and works good for traditional tables and PDFs |
Nanonets | Open-Source | 83.2 | Perfect data extraction, imperfect markdown | Best for users prioritizing complete data extraction in a self-hosted environment where processing time is not a critical factor |
Gemini-2.5-pro | Paid | 45 | Worked perfectly in all of our testings. | Best for applications requiring accuracy and reliable performance on complex documents, where ease of use is a priority |
Each of these document parsers brings something unique to the table. Dolphin is lightweight and open-source but sacrifices accuracy in complex layouts. LandingAI is simple to integrate but can get costly over time. LlamaParse strikes a balance with ease of use and free credits, though it struggles with highly detailed financials. Nanonets is the accuracy champion among open-source models, but its GPU demand and slower speed make it better suited for environments where time isn’t critical. Finally, Gemini-2.5-pro delivers the best all-around performance, fast, accurate, and user-friendly, provided you’re comfortable with a paid, closed-source solution.
If your priority is accuracy with self-hosted control, Nanonets is the best pick. But if you want a fast, reliable, and easy-to-scale solution, Gemini-2.5-pro clearly stands out as the winner.