u/Financial-Sort3957

I’m working on a construction document AI system and trying to solve a high-precision extraction problem.

This is not basic “chat with PDF.” The system ingests plans/specs/finish schedules/door schedules/MEP drawings and needs to output strict structured ledgers.

The failure mode:

RAG can often find the evidence, but the pipeline fails to turn it into clean first-class rows.

Example target rows:

  • Wilsonart PL1 = 4880-38 Carbon Mesh
  • Wilsonart PL2 = 4886 Pearl Soapstone
  • Mohawk LVT = Living Local, Two Tone 958, 7.75" x 52"
  • Daltile Portfolio = Ash Grey
  • Schlage Saturn = 626 satin chromium
  • Greenheck EF-1 = SP-A90
  • American Standard P-1 = #215AA.104/105

The app often finds the text somewhere, but merges/buries/misroutes it:

  • PL1/PL2 become “Wilsonart 4880 / 4886”
  • LVT/carpet/tile tokens get blended
  • door hardware is found in submittals but never becomes a clean spec-detail row
  • facts land in evidence excerpts or scope rows instead of a strict material/spec ledger

We tried standard RAG, agentic RAG, focused trade calls, ledgers, submittal extractors, golden audits, bridge checks, etc.

Current architecture is:

Docs → OCR/chunks/tables → Evidence Store → focused extraction → strict ledgers → views

Ledgers:

  • Spec Detail Ledger = manufacturer/model/finish/color/size/criteria/source/evidence
  • Submittal Ledger = vendor deliverables
  • Scope Ledger = installed work/trade scope

The rule is supposed to be: if evidence exists, it must land in the correct ledger before any PM display/view formatting.

Question: how would you design the extraction flow so exact model numbers/colors/finish tags reliably become structured rows instead of getting merged or buried?

Would you use:

  • page-level vision calls for schedules/finish legends?
  • direct PDF calls for spec pages?
  • table extraction before RAG?
  • one extractor per spec category?
  • constrained JSON schema with one row per product?
  • post-extraction audit/repair passes?
  • something else?

Looking for serious advice from people who have solved high-precision document extraction, not generic RAG tips.

reddit.com
u/Financial-Sort3957 — 7 days ago
▲ 5 r/Rag

I’m working on a construction document AI system and trying to solve a high-precision extraction problem.

This is not basic “chat with PDF.” The system ingests plans/specs/finish schedules/door schedules/MEP drawings and needs to output strict structured ledgers.

The failure mode:

RAG can often find the evidence, but the pipeline fails to turn it into clean first-class rows.

Example target rows:

  • Wilsonart PL1 = 4880-38 Carbon Mesh
  • Wilsonart PL2 = 4886 Pearl Soapstone
  • Mohawk LVT = Living Local, Two Tone 958, 7.75" x 52"
  • Daltile Portfolio = Ash Grey
  • Schlage Saturn = 626 satin chromium
  • Greenheck EF-1 = SP-A90
  • American Standard P-1 = #215AA.104/105

The app often finds the text somewhere, but merges/buries/misroutes it:

  • PL1/PL2 become “Wilsonart 4880 / 4886”
  • LVT/carpet/tile tokens get blended
  • door hardware is found in submittals but never becomes a clean spec-detail row
  • facts land in evidence excerpts or scope rows instead of a strict material/spec ledger

We tried standard RAG, agentic RAG, focused trade calls, ledgers, submittal extractors, golden audits, bridge checks, etc.

Current architecture is:

Docs → OCR/chunks/tables → Evidence Store → focused extraction → strict ledgers → views

Ledgers:

  • Spec Detail Ledger = manufacturer/model/finish/color/size/criteria/source/evidence
  • Submittal Ledger = vendor deliverables
  • Scope Ledger = installed work/trade scope

The rule is supposed to be: if evidence exists, it must land in the correct ledger before any PM display/view formatting.

Question: how would you design the extraction flow so exact model numbers/colors/finish tags reliably become structured rows instead of getting merged or buried?

Would you use:

  • page-level vision calls for schedules/finish legends?
  • direct PDF calls for spec pages?
  • table extraction before RAG?
  • one extractor per spec category?
  • constrained JSON schema with one row per product?
  • post-extraction audit/repair passes?
  • something else?

Looking for serious advice from people who have solved high-precision document extraction, not generic RAG tips.

reddit.com
u/Financial-Sort3957 — 7 days ago
▲ 3 r/LLMeng+2 crossposts

I’m building an AI document intelligence platform for commercial construction, and I’m at the point where I need help from someone who truly understands RAG, document extraction, structured outputs, OCR, multi-document reasoning, and LLM pipeline architecture.

This is not a basic “how do I chat with a PDF?” problem.

The product ingests construction documents such as drawings, specs, addenda, RFPs, schedules, finish legends, door schedules, MEP drawings, plumbing fixture schedules, etc. The goal is to extract useful construction outputs like:

  • Scope by division/trade
  • Submittal register
  • Strict material/specification ledger
  • Manufacturer/model/color/finish/spec criteria
  • Compliance requirements
  • Schedule/milestone requirements
  • Risk flags
  • Cross-document conflicts
  • Evidence-backed citations/page references

The biggest issue is with material/spec detail extraction.

For example, from a small tenant improvement project, the system should extract rows like:

  • Wilsonart Standard Laminate, PL1, 4880-38 Carbon Mesh
  • Wilsonart Standard Laminate, PL2, 4886 Pearl Soapstone
  • Mohawk Group Living Local LVT, Two Tone 958, 7.75" x 52" plank
  • Mohawk Group carpet tile, Steel 937
  • Daltile Portfolio, Ash Grey, 12" x 24"
  • Daltile Fabrique, Blanc Linen
  • Sherwin-Williams ProMar 200 / Pro-Cryl systems with color numbers and VOC/mil criteria
  • Schlage Saturn Series, 626 satin chromium
  • LCN 1460 closers
  • Hager BB1191 hinges
  • CECO hollow metal impact door with Florida Product Approval
  • Overhead Door 427/429 with NOA number
  • Kelley KM7130 dock leveler
  • Greenheck SP-A90 exhaust fan
  • Titus PAS/PAR diffuser/grille models
  • American Standard / Elkay plumbing fixture models
  • Type L / Type K copper piping
  • 3M CP 25WB+ / FB-3000 WT firestopping

A normal ChatGPT/OpenAI UI run can produce a much better strict spec ledger manually from the same documents. But our actual application pipeline keeps underperforming.

What works:

OCR/chunking is mostly okay. RAG can often find the facts. The evidence exists. Focused calls are better than broad calls. Ledgers are the right direction. Golden audits are useful.

What keeps going wrong:

The system finds some facts, but they do not reliably become clean, first-class structured rows.

Facts get buried in evidence excerpts, scope rows, submittal rows, or verification rows instead of becoming clean material/spec rows.

For example, it may know “Wilsonart 4880 / 4886” exists, but it merges PL1 and PL2 instead of separating:

  • PL1 = 4880-38 Carbon Mesh
  • PL2 = 4886 Pearl Soapstone

Or it may find Mohawk flooring text but blend carpet tile, LVT, colors, and model names together.

Or it may find door hardware in a submittal row but fail to create separate strict spec rows for Schlage, LCN, Hager, Rockwood, etc.

We have tried many approaches:

  • Standard RAG
  • Agentic RAG
  • More complex multi-pass RAG
  • Focused division/trade calls
  • Submittal-specific extractors
  • Scope ledgers
  • Archipelago-style ledgers
  • Candidate rows
  • Final adjudication
  • Regression/golden audits
  • Cross-pack bridge checks
  • PM display cleanup layers
  • Deterministic fallback rows
  • Evidence IDs
  • Vector search / evidence search
  • Postgres ledger persistence
  • Multiple architecture refactors

The current simplified architecture we are testing is:

Documents
→ OCR/chunks/tables
→ Evidence Store
→ Focused extraction
→ Strict ledgers
→ Views

Ledgers are split like this:

Spec Detail Ledger
Material/product facts: manufacturer, model, finish, color, size, criteria, source doc, evidence.

Submittal Ledger
Vendor deliverables: product data, shop drawings, samples, calculations, certificates, warranties.

Scope Ledger
Installed work: trade scope, division, responsibility, coordination flags.

Views are generated after ledger persistence:

  • Spec Detail Ledger → Material Specifications View
  • Submittal Ledger → PM Submittal Register
  • Scope Ledger → Division Map / Scope Packages
  • Verification rows → QA / Review View

The new rule is supposed to be:

If source evidence exists, it must land in the correct ledger first. Views can group or format, but they cannot silently delete source-backed facts.

The old flow that caused problems was:

Docs → OCR/chunks → RAG → focused extraction → candidate rows → adjudication → final_submittals[] → ledger → PM view

The issue was that final_submittals[] and final adjudication became the authority. If the model compressed, merged, or demoted a row, the facts disappeared from the visible output even when evidence existed.

The new architecture is better conceptually, but output quality is still far behind what the OpenAI UI can produce manually.

Recent result:

The app generated around 20 strict spec detail rows for a project where a manual OpenAI UI run produced around 70+ useful product/material rows.

The app caught some things like:

  • 3M firestopping
  • Greenheck exhaust fan
  • Mecho window treatments
  • Some USG ceiling/drywall data
  • Some Mohawk/Daltile/Wilsonart evidence

But it missed, merged, or misplaced many of the actual buyout-critical model/color/spec items.

The most important problem is not general “summarization.” It is reliable extraction of:

  • Manufacturer
  • Model / series
  • Finish / color
  • Finish tag
  • Size
  • Product criteria
  • Installation criteria
  • Standards / codes / NOA / FPA / ASTM / UL / NFPA references
  • Source document/page/sheet
  • Evidence excerpt
  • Confidence/status
  • N/S when not specified

I’m trying to figure out the real solution.

Questions:

  1. Is this mainly a retrieval/evidence-packing problem?
  2. Is it a schema/prompt problem?
  3. Is it an OCR/table extraction problem?
  4. Is it a ledger routing problem?
  5. Should spec detail extraction bypass submittals entirely and run as its own direct material/spec pass?
  6. Should the system use direct PDF/page vision calls for schedules and finish legends instead of relying only on chunks?
  7. How would you design the extraction pipeline if the goal is buyout-grade construction material/spec rows?
  8. How do you prevent model output from merging distinct product rows?
  9. How do you audit this reliably against a human-created golden output?
  10. Has anyone solved this type of issue for construction plans/specs, legal docs, medical docs, insurance docs, or any other high-detail document extraction problem?

I am not looking for generic RAG advice. I’m looking for someone who has dealt with high-precision document extraction where the evidence exists, but the system fails to turn it into clean structured outputs.

If you have experience with this, I would really appreciate your thoughts. I am also open to hiring the right person/consultant if they can actually help solve it.

reddit.com
u/Financial-Sort3957 — 7 days ago