How to parse tables from pdfs with 100% accuracy?
I've tried a lot over the past 2w but can't find a simple solution. I basically have pdf's with 100 row tables, and want to extract the tables into csv's. I tried paid online services like extend, reducto, landing, gemini, none are 100% accurate since they are OCR models.
I get accurate text extraction if I use python pdf libraries like pdfplumber/camelot. The problem is that pdf's don't have a standard way of representing tables so the output columns are sometimes combined/split improperly. 2 columns get merged. I tried adjusting some parameters but it either over or under merges columns.
What is the solution to using python libraries properly? It's a pita to solve and I'm surprised it's not easier.