u/Financial-Sort3957 — reddlx

I’m working on a construction document AI system and trying to solve a high-precision extraction problem.

This is not basic “chat with PDF.” The system ingests plans/specs/finish schedules/door schedules/MEP drawings and needs to output strict structured ledgers.

The failure mode:

RAG can often find the evidence, but the pipeline fails to turn it into clean first-class rows.

Example target rows:

Wilsonart PL1 = 4880-38 Carbon Mesh
Wilsonart PL2 = 4886 Pearl Soapstone
Mohawk LVT = Living Local, Two Tone 958, 7.75" x 52"
Daltile Portfolio = Ash Grey
Schlage Saturn = 626 satin chromium
Greenheck EF-1 = SP-A90
American Standard P-1 = #215AA.104/105

The app often finds the text somewhere, but merges/buries/misroutes it:

PL1/PL2 become “Wilsonart 4880 / 4886”
LVT/carpet/tile tokens get blended
door hardware is found in submittals but never becomes a clean spec-detail row
facts land in evidence excerpts or scope rows instead of a strict material/spec ledger

We tried standard RAG, agentic RAG, focused trade calls, ledgers, submittal extractors, golden audits, bridge checks, etc.

Current architecture is:

Docs → OCR/chunks/tables → Evidence Store → focused extraction → strict ledgers → views

Ledgers:

Spec Detail Ledger = manufacturer/model/finish/color/size/criteria/source/evidence
Submittal Ledger = vendor deliverables
Scope Ledger = installed work/trade scope

The rule is supposed to be: if evidence exists, it must land in the correct ledger before any PM display/view formatting.

Question: how would you design the extraction flow so exact model numbers/colors/finish tags reliably become structured rows instead of getting merged or buried?

Would you use:

page-level vision calls for schedules/finish legends?
direct PDF calls for spec pages?
table extraction before RAG?
one extractor per spec category?
constrained JSON schema with one row per product?
post-extraction audit/repair passes?
something else?

Looking for serious advice from people who have solved high-precision document extraction, not generic RAG tips.

reddit.com

u/Financial-Sort3957 — 7 days ago

▲ 5 r/Rag

I’m working on a construction document AI system and trying to solve a high-precision extraction problem.

This is not basic “chat with PDF.” The system ingests plans/specs/finish schedules/door schedules/MEP drawings and needs to output strict structured ledgers.

The failure mode:

RAG can often find the evidence, but the pipeline fails to turn it into clean first-class rows.

Example target rows:

Wilsonart PL1 = 4880-38 Carbon Mesh
Wilsonart PL2 = 4886 Pearl Soapstone
Mohawk LVT = Living Local, Two Tone 958, 7.75" x 52"
Daltile Portfolio = Ash Grey
Schlage Saturn = 626 satin chromium
Greenheck EF-1 = SP-A90
American Standard P-1 = #215AA.104/105

The app often finds the text somewhere, but merges/buries/misroutes it:

PL1/PL2 become “Wilsonart 4880 / 4886”
LVT/carpet/tile tokens get blended
door hardware is found in submittals but never becomes a clean spec-detail row
facts land in evidence excerpts or scope rows instead of a strict material/spec ledger

We tried standard RAG, agentic RAG, focused trade calls, ledgers, submittal extractors, golden audits, bridge checks, etc.

Current architecture is:

Docs → OCR/chunks/tables → Evidence Store → focused extraction → strict ledgers → views

Ledgers:

Spec Detail Ledger = manufacturer/model/finish/color/size/criteria/source/evidence
Submittal Ledger = vendor deliverables
Scope Ledger = installed work/trade scope

The rule is supposed to be: if evidence exists, it must land in the correct ledger before any PM display/view formatting.

Question: how would you design the extraction flow so exact model numbers/colors/finish tags reliably become structured rows instead of getting merged or buried?

Would you use:

page-level vision calls for schedules/finish legends?
direct PDF calls for spec pages?
table extraction before RAG?
one extractor per spec category?
constrained JSON schema with one row per product?
post-extraction audit/repair passes?
something else?

Looking for serious advice from people who have solved high-precision document extraction, not generic RAG tips.

reddit.com

u/Financial-Sort3957 — 7 days ago

▲ 3 r/LLMeng+2 crossposts

I’m building an AI document intelligence platform for commercial construction, and I’m at the point where I need help from someone who truly understands RAG, document extraction, structured outputs, OCR, multi-document reasoning, and LLM pipeline architecture.

This is not a basic “how do I chat with a PDF?” problem.

The product ingests construction documents such as drawings, specs, addenda, RFPs, schedules, finish legends, door schedules, MEP drawings, plumbing fixture schedules, etc. The goal is to extract useful construction outputs like:

Scope by division/trade
Submittal register
Strict material/specification ledger
Manufacturer/model/color/finish/spec criteria
Compliance requirements
Schedule/milestone requirements
Risk flags
Cross-document conflicts
Evidence-backed citations/page references

The biggest issue is with material/spec detail extraction.

For example, from a small tenant improvement project, the system should extract rows like:

Wilsonart Standard Laminate, PL1, 4880-38 Carbon Mesh
Wilsonart Standard Laminate, PL2, 4886 Pearl Soapstone
Mohawk Group Living Local LVT, Two Tone 958, 7.75" x 52" plank
Mohawk Group carpet tile, Steel 937
Daltile Portfolio, Ash Grey, 12" x 24"
Daltile Fabrique, Blanc Linen
Sherwin-Williams ProMar 200 / Pro-Cryl systems with color numbers and VOC/mil criteria
Schlage Saturn Series, 626 satin chromium
LCN 1460 closers
Hager BB1191 hinges
CECO hollow metal impact door with Florida Product Approval
Overhead Door 427/429 with NOA number
Kelley KM7130 dock leveler
Greenheck SP-A90 exhaust fan
Titus PAS/PAR diffuser/grille models
American Standard / Elkay plumbing fixture models
Type L / Type K copper piping
3M CP 25WB+ / FB-3000 WT firestopping

A normal ChatGPT/OpenAI UI run can produce a much better strict spec ledger manually from the same documents. But our actual application pipeline keeps underperforming.

What works:

OCR/chunking is mostly okay. RAG can often find the facts. The evidence exists. Focused calls are better than broad calls. Ledgers are the right direction. Golden audits are useful.

What keeps going wrong:

The system finds some facts, but they do not reliably become clean, first-class structured rows.

Facts get buried in evidence excerpts, scope rows, submittal rows, or verification rows instead of becoming clean material/spec rows.

For example, it may know “Wilsonart 4880 / 4886” exists, but it merges PL1 and PL2 instead of separating:

PL1 = 4880-38 Carbon Mesh
PL2 = 4886 Pearl Soapstone

Or it may find Mohawk flooring text but blend carpet tile, LVT, colors, and model names together.

Or it may find door hardware in a submittal row but fail to create separate strict spec rows for Schlage, LCN, Hager, Rockwood, etc.

We have tried many approaches:

Standard RAG
Agentic RAG
More complex multi-pass RAG
Focused division/trade calls
Submittal-specific extractors
Scope ledgers
Archipelago-style ledgers
Candidate rows
Final adjudication
Regression/golden audits
Cross-pack bridge checks
PM display cleanup layers
Deterministic fallback rows
Evidence IDs
Vector search / evidence search
Postgres ledger persistence
Multiple architecture refactors

The current simplified architecture we are testing is:

Documents
→ OCR/chunks/tables
→ Evidence Store
→ Focused extraction
→ Strict ledgers
→ Views

Ledgers are split like this:

Spec Detail Ledger
Material/product facts: manufacturer, model, finish, color, size, criteria, source doc, evidence.

Submittal Ledger
Vendor deliverables: product data, shop drawings, samples, calculations, certificates, warranties.

Scope Ledger
Installed work: trade scope, division, responsibility, coordination flags.

Views are generated after ledger persistence:

Spec Detail Ledger → Material Specifications View
Submittal Ledger → PM Submittal Register
Scope Ledger → Division Map / Scope Packages
Verification rows → QA / Review View

The new rule is supposed to be:

If source evidence exists, it must land in the correct ledger first. Views can group or format, but they cannot silently delete source-backed facts.

The old flow that caused problems was:

Docs → OCR/chunks → RAG → focused extraction → candidate rows → adjudication → final_submittals[] → ledger → PM view

The issue was that final_submittals[] and final adjudication became the authority. If the model compressed, merged, or demoted a row, the facts disappeared from the visible output even when evidence existed.

The new architecture is better conceptually, but output quality is still far behind what the OpenAI UI can produce manually.