I’m working on a construction document AI system and trying to solve a high-precision extraction problem.
This is not basic “chat with PDF.” The system ingests plans/specs/finish schedules/door schedules/MEP drawings and needs to output strict structured ledgers.
The failure mode:
RAG can often find the evidence, but the pipeline fails to turn it into clean first-class rows.
Example target rows:
- Wilsonart PL1 = 4880-38 Carbon Mesh
- Wilsonart PL2 = 4886 Pearl Soapstone
- Mohawk LVT = Living Local, Two Tone 958, 7.75" x 52"
- Daltile Portfolio = Ash Grey
- Schlage Saturn = 626 satin chromium
- Greenheck EF-1 = SP-A90
- American Standard P-1 = #215AA.104/105
The app often finds the text somewhere, but merges/buries/misroutes it:
- PL1/PL2 become “Wilsonart 4880 / 4886”
- LVT/carpet/tile tokens get blended
- door hardware is found in submittals but never becomes a clean spec-detail row
- facts land in evidence excerpts or scope rows instead of a strict material/spec ledger
We tried standard RAG, agentic RAG, focused trade calls, ledgers, submittal extractors, golden audits, bridge checks, etc.
Current architecture is:
Docs → OCR/chunks/tables → Evidence Store → focused extraction → strict ledgers → views
Ledgers:
- Spec Detail Ledger = manufacturer/model/finish/color/size/criteria/source/evidence
- Submittal Ledger = vendor deliverables
- Scope Ledger = installed work/trade scope
The rule is supposed to be: if evidence exists, it must land in the correct ledger before any PM display/view formatting.
Question: how would you design the extraction flow so exact model numbers/colors/finish tags reliably become structured rows instead of getting merged or buried?
Would you use:
- page-level vision calls for schedules/finish legends?
- direct PDF calls for spec pages?
- table extraction before RAG?
- one extractor per spec category?
- constrained JSON schema with one row per product?
- post-extraction audit/repair passes?
- something else?
Looking for serious advice from people who have solved high-precision document extraction, not generic RAG tips.