Big2 sources· last seen 17h ago· first seen 17h ago

Vision-capable LLMs vs. OCR for long-document (including charts, images, tables, etc.) QA

I benchmarked vision-capable LLMs (the "just attach the PDF and let the model read it" pattern) against OCR-based pipelines on 30 long, image-heavy PDFs from MMLongBench-Doc ([https://github.com/mayubo2333/MMLongBench-Doc](https://github.com/mayubo2333/MMLongBench-Doc)). There were 171 questions in

Lead: r/artificialBigness: 62vision-capablellmsocrlong-documentincluding
📡 Coverage
50
2 news sources
🟠 Hacker News
0
🔴 Reddit
73
67 upvotes across 2 subs
📈 Google Trends
0
Full methodology: How scoring works

Receipts (all sources)

score 100

I benchmarked vision-capable LLMs (the "just attach the PDF and let the model read it" pattern) against OCR-based pipelines on 30 long, image-heavy PDFs from MMLongBench-Doc ([https://github.com/mayubo2333/MMLongBench-Doc](https://github.com/mayubo2333/MMLongBench-Doc)). There were 171 questions in

score 98

I benchmarked vision-capable LLMs (the "just attach the PDF and let the model read it" pattern) against OCR-based pipelines on 30 long, image-heavy PDFs from MMLongBench-Doc ([https://github.com/mayubo2333/MMLongBench-Doc](https://github.com/mayubo2333/MMLongBench-Doc)). There were 171 questions in