u/bravelogitex

▲ 5 r/Rag

How to parse tables from pdfs with 100% accuracy?

I've tried a lot over the past 2w but can't find a simple solution. I basically have pdf's with 100 row tables, and want to extract the tables into csv's. I tried paid online services like extend, reducto, landing, gemini, none are 100% accurate since they are OCR models.

I get accurate text extraction if I use python pdf libraries like pdfplumber/camelot. The problem is that pdf's don't have a standard way of representing tables so the output columns are sometimes combined/split improperly. 2 columns get merged. I tried adjusting some parameters but it either over or under merges columns.

What is the solution to using python libraries properly? It's a pita to solve and I'm surprised it's not easier.

reddit.com
u/bravelogitex — 7 hours ago

I just did some texting across various providers and wanted to share my use case. It was construction spec tables, 100 rows max, png's passed in, and my #1 requirement was maximum accuracy (100% is ideal since mistakes can be costly).

I used the following, here they are ranked from best to worst:

  1. Extend - used their playground easy to play around with, it quickly worked at 100% with minimal configuration. Was a surprise because they seemed similar to reducto (used down below).
  2. Gemini - easy to work with, all I needed to pass in was a base64 of the image and a prompt. 100% accurate for less than 50 rows, couple errors started occuring >50 rows.
  3. Reducto - basically extend but 66% accurate. Results were pretty bad, yikes.
  4. Mistral OCR - used it on just 1 png, it didn't return the bottom couple rows for some reason. Stopped using it as missing rows were unacceptable.
reddit.com
u/bravelogitex — 19 days ago