r/Rag

▲ 4 r/Rag

How to parse tables from pdfs with 100% accuracy?

I've tried a lot over the past 2w but can't find a simple solution. I basically have pdf's with 100 row tables, and want to extract the tables into csv's. I tried paid online services like extend, reducto, landing, gemini, none are 100% accurate since they are OCR models.

I get accurate text extraction if I use python pdf libraries like pdfplumber/camelot. The problem is that pdf's don't have a standard way of representing tables so the output columns are sometimes combined/split improperly. 2 columns get merged. I tried adjusting some parameters but it either over or under merges columns.

What is the solution to using python libraries properly? It's a pita to solve and I'm surprised it's not easier.

u/bravelogitex — 6 hours ago

▲ 0 r/Rag

Spent a weekend debugging why my RAG pipeline gave garbage answers, turned out the problem wasn't the model at all

Built a basic RAG setup a few months ago. Retrieval looked fine, model was decent, but the answers were consistently half-wrong or weirdly incomplete.

Spent way too long suspecting the LLM. Swapped models twice. Still bad.

Turned out the issue was how I was chunking documents.

I was using fixed 512-token chunks with no overlap. Clean, simple, felt logical. But the retrieved chunks kept cutting sentences mid-thought, sometimes right before the actual answer, sometimes right after. The model was working with literally incomplete information and hallucinating the rest.

What actually helped:

1. Adding overlap (obvious in hindsight) Went from 0 overlap to ~50 tokens. Retrieval quality jumped immediately. The "answer" wasn't getting split across two chunks anymore.

2. Respecting natural document boundaries Splitting by paragraph or section instead of raw token count made a huge difference for structured documents like PDFs and docs with headers.

3. Smaller chunks + more of them Counterintuitive but retrieving 6 small clean chunks beat retrieving 3 large messy ones. Less noise in the context window.

4. Checking what actually got retrieved I wasn't logging retrieved chunks at all early on. Once I started printing them, I immediately saw the problem. Obvious step I skipped because I assumed retrieval was working.

The model was never the bottleneck. The garbage-in-garbage-out problem was upstream the whole time.

Curious if others ran into this, especially with PDFs. Those feel like a special kind of painful.

u/Helpful_Regular_30 — 3 hours ago

▲ 3 r/Rag

Anyone tried the new Granite 4.1 models (3B and 8B) for RAG?

It seems RAG is one of their main purpose. I'm looking to do my first local RAG project and am looking for suitable 4b and 8b models. Also, which of the LLM benchmarks are important when considering the RAG application?

u/atumblingdandelion — 10 hours ago

▲ 9 r/Rag+3 crossposts

Free RAG Interview Q&A repo with all 10 types of RAG. 50 questions with detailed answers, difficulty tags, and a decision tree. Contributors welcome!

Hey everyone,

I've been going deep on RAG architectures lately and couldn't find a single resource that covered all the modern variants in one place, so I built one and open-sourced it.

What's in the repo:

10 sections covering every major RAG type
50 interview questions tagged [Basic] / [Intermediate] / [Advanced]
Detailed answers with architecture diagrams, code snippets, and trade-off tables
A cheatsheet with a decision tree ("which RAG should I use?")
GitHub Pages site auto-deployed on every push

RAG types covered: Naive, Advanced, Modular, Agentic, Graph, Corrective (CRAG), Self-RAG, Speculative, Multi-modal, and Long-context RAG.

https://github.com/ather-techie/rag-interview-questions

Looking for contributors! If you've been in an ML/LLM interview recently and got a question not covered here, please open a PR or drop it in the comments. I'll add it with credit.

If this is useful, a star on GitHub goes a long way. it helps others discover it. Thanks!

u/Western-Slip199 — 14 hours ago

▲ 2 r/Rag

New to rag

Hii guys I am new to rag and currently learning about vector and vector less rag by using clean text document like PDFs

I asked chat gpt on how to master rag and it gave detailed steps but I want to know what is the most advanced type of rag at the present.

I have learnt a bit with vector less rag on text documents now I have to learn on how to use vector rag on text and later use them both to make a single rag.

If there is any other kind of rag other than these two please suggest them.

u/ExtensionDetective85 — 18 hours ago

▲ 3 r/Rag

What to Learn in RAG + Project Recommendations

I started learning RAG a little while ago and have built two pipelines. One by following a tutorial and one by experimenting on my own and also tried various methods. Now I know how to pick up new things and implement them, but I’m still not sure what to learn next.

Most of what I find online is just basic chunking and retrieval methods, nothing beyond that. Can anyone please suggest what I should focus on learning and how to figure out the right path?

Also, what kind of projects would be good to build if I want to attract clients?

u/PenEquivalent5091 — 1 day ago

▲ 32 r/Rag

Legal RAG remains unsolved because it needs authority, not just relevance

RAG for the legal domain has been “hot” for a long time, and the market is now crowded with products.

I see a lot of posts from devs/lawyers building legal RAG, but discussions focused mainly around chunking, embeddings, reranking, and fine-tuning. That is important, but I think they overlook the harder question: what will actually help legal professionals?

I wrote down my impressions on why useful Legal RAG is still hard even after many years of research/products:

Legal queries are complex. They need keyword search, semantic search, jurisdiction awareness, and some legal knowledge baked into the retrieval process. So we probably need robust hybrid/agentic search pipelines, not just vector search. This is harder to build.
Retrieving “superficially” relevant cases/citations is not enough. A citation can be semantically relevant but legally unusable: overruled, wrong jurisdiction, lower court, stale, or not citable for the point you need.
This second issue is critical. It needs "authority-aware" retrieval and citation validation, both of which need significant human involvement. It is not something a better embedding model or reranking alone will fix.

I also think this is a problem with many benchmarks. Without enough human involvement, benchmarks end up being curated with LLM judges, checking narrow retrieval from specific passages, and do not match the messier patterns lawyers deal with in reality.

Without hard, realistic public legal benchmarks, it is difficult to know whether we are building “real” Legal AI, or just better demos.

If you’ve tried building Legal RAG, or getting lawyers to use your tool, I’d love to know the challenges you faced and the top blockers to adoption.

Longer write-up here: https://agentengg.substack.com/p/why-legal-ai-remains-unsolved-a-technical

u/ekshaks — 1 day ago

▲ 2 r/Rag

Built a Fetch API that returns page labels, not just markdown

I'm working on a Fetch API for RAG, agents, and web ingestion workflows.

Think Firecrawl/Jina Reader-style URL-to-markdown or clean-text API, but with one extra signal layer: page labels for content category and page structure.

The pain point: fetching is only the first step. You still need to decide whether a page is useful, relevant, and worth sending into indexing, embedding, or an LLM pipeline.

Examples of labels we return:

dead link / main content missing → skip low-value pages early
homepage / index page vs content page → avoid mixing navigation/listing pages with real content
content category → keep vertical pipelines from indexing out-of-scope pages, e.g. a finance workflow pulling in random entertainment/forum pages

Our category labels cover broad areas like Finance, Health, News, Ecommerce, Education, Jobs, Travel, and more.

A couple of open questions:

If you've already built filtering logic on top of a fetch API — skipping listing pages, filtering by topic, dropping dead links — curious what that looks like in your pipeline. Does moving this upstream actually save work, or just add a layer you'd rather control yourself?
Beyond category and page structure, what other fields or labels would actually be useful in a fetch API response? Author, publish date, sentiment, product pricing, freshness signals...? Curious what's missing from current fetch tools for your pipeline.

Happy to share access if you want to try it. New signups get $5 credit, around 5k pages.

Try it / sign up: https://octen.ai/platform/extract
Full docs: https://docs.octen.ai/api-reference/extract#response-data-results-items-category

u/Shot-Neighborhood332 — 1 day ago

▲ 42 r/Rag

why does everyone skip the chunking part

every RAG tutorial i've seen spends 80% of the time on vector databases and embeddings and then says "chunk your documents" like it's obvious and moves on.

it's not obvious. it's actually the thing that breaks most implementations.

fixed size chunking splits wherever the token limit hits. doesn't care about sentence boundaries, doesn't care if two sentences only make sense together. you end up retrieving half a thought and the model fills in the rest, confidently, which is the whole problem you were trying to solve.

sliding window with overlap is what most people actually use in production and it's fine, but the real thing that helped me was just reading what was actually getting retrieved for failed queries instead of assuming the pipeline was working. almost always the chunk was on the right topic but missing the sentence that contained the actual answer.

the other thing, vector search breaks on exact identifiers. someone asks about a specific model number or product code, semantic search returns "close enough" results. close enough is wrong. hybrid search with BM25 alongside vectors handles this but it never shows up in the intro tutorials so you find out the hard way.

and stale index. you update a document, don't re-index, user gets a confidently wrong answer. it's not a technical problem it's a pipeline problem which is probably why nobody writes about it.

curious what others are doing for re-indexing, currently on a schedule and it works but feels fragile.

u/SilverConsistent9222 — 2 days ago

▲ 0 r/Rag

Looking Founding Engineering Intern

Currently at Pre-MVP stage - HRTech AI Product

Looking for someone specialised in RAG/Page Indexing along with LLMs.

Initial offer - 3 months unpaid internship - MVP Stage .

Post 3 months full time conversion based on performance with equity on table.

DM me your work and contact details if interested.

u/Ok-Dish-4208 — 2 days ago

▲ 7 r/Rag

I built a plug-and-play, portable RAG system for local codebases.

I’ve been building something called Spectrum, and the idea is pretty simple:

repo/folder -&gt; .specpack -&gt; local HTTP API -&gt; agent-ready search + context

With so many people vibe coding but without much knowledge of how RAG works, RAG can be overwhelming to setup and maintain. So I have gone down a GUI route (there's CLI and API too) to make it work alongside apps like Codex and Claude.

Spectrum takes a project folder or repo, packs it into a portable .specpack, builds a lightweight retrieval index, and serves it locally over HTTP so agents and tools can actually search and read the project context properly.

Basically: instead of wiring up a vector DB, cloud service, random chunk folders, loose indexes, and whatever other goblin machinery usually appears the second you say “RAG”, you get one portable project context bundle.

The goal is:

pack a project into one file
move that file between machines easily
serve it locally
allow for additional corpuses to be added to the live server and index on the fly
let agents search, hydrate, and read the actual project context
keep code, docs, notes, config, and ops stuff together
avoid depending on cloud services just to give an agent memory of a repo

Current flow looks like this:

spectrum load ./my-repo ./my-repo.specpack
spectrum serve ./my-repo.specpack --port 7777

Then agents/tools can hit endpoints like:

GET  /projects/repo/context
POST /packs/repo/search
GET  /packs/repo/documents/{path}

I’m not trying to replace full vector databases.

It’s more like a portable project briefcase for agents: one file that carries the project context, retrieval index, and document hydration with it.

The bit I’m especially interested in is portability. A .specpack can be copied between devices, stored with a project, passed to another machine, or used as a local memory/context layer without rebuilding a whole RAG setup every time.

I’m also working on encrypted .specpack support, so the model becomes:

locked .specpack at rest
unlock once locally with a password
agents use the fast local API

So you can have portable project memory that is local-first, easy to move around, and password-protected when not in use.

Would genuinely love feedback from anyone working with local codebase RAG, agent context, repo memory, or dev-tool workflows.

The links for github and the GUI apps are available at https://bytespectrum.cc/

Especially interested in whether this solves a real pain point for people, or whether I’ve just built a very fancy briefcase for myself.

u/Otherwise-Ad9322 — 2 days ago

▲ 1 r/Rag

Exploring hash-addressed evidence chunks as a complement to RAG

I’ve been working on an early exploration project around a common RAG retrieval problem.

This is not meant to argue that RAG is obsolete or that vector search is useless. The hypothesis is narrower:

Can hash-addressed evidence chunks act as a complementary pre-retrieval boundary layer for RAG?

The idea is to take high-value or frequently reused text, split it into chunks, and attach each chunk to metadata such as corpus, jurisdiction, unit, source_hash, and evidence_hash.

Traditional RAG often starts with:

“Which chunk is semantically similar?”

This exploration first asks:

“Which bounded evidence address is this query allowed to search?”

This does not mean removing BM25, vector search, or hybrid retrieval. The idea is to narrow the allowed search space first, then let BM25 / TF-IDF / vector / hybrid retrieval work inside that bounded subset.

I call this experiment HSRAG: Hash-Structured Retrieval-Augmented Generation. The name may sound larger than the current implementation, so the more precise description is: an early exploration project where hash-addressed chunks act as retrieval boundaries and auditable evidence units. I plan to keep testing different hypotheses, baselines, corpus settings, and failure cases.

I used legal text as the first benchmark domain because legal retrieval has many easy-to-define cross-domain failure cases, such as:

EU AI Act Article 5
U.S. FTC Act Section 5
CDA Section 230
EU DMA gatekeeper obligations

The latest benchmark is RQ6, which tests multi-turn retrieval contamination. For example, if a user first asks about EU law and then switches to a similar U.S. law, does the retriever incorrectly carry over the previous EU context?

RQ6 stress run:

20,000 Monte Carlo trials
720,000 result rows
retrieval modes: BM25, TF-IDF, Hybrid RRF, HSRAG CTHC, HSRAG Hybrid Subset
context policies: no_memory, naive_memory, bounded_cthc_memory

In this controlled benchmark, the current observation is that HSRAG modes had:

0 wrong-corpus retrieval
0 wrong-jurisdiction retrieval
0 false allow on NO_EVIDENCE / AMBIGUOUS cases
0 cross-turn contamination

Important caveats:

This depends on clean upfront corpus classification
This is not legal advice
This is not production-ready
This is not a RAG replacement claim
If metadata is wrong, the hash-addressed boundary can also be wrong
Multi-law comparison questions should still be decomposed into atomic retrieval tasks before synthesis

Repo: https://github.com/Void-Ghost000/HSRAG

I’m curious how people here would think about this:

Are hash-addressed evidence chunks useful as a complement to RAG?
Is this better described as metadata filtering, routing, or retrieval governance?
What would be a fair vector baseline for this kind of benchmark?
In production RAG, would the biggest issue be ingestion, metadata quality, scale, or query decomposition?

u/Ill-Structure4482 — 2 days ago

▲ 16 r/Rag

Deepseek v4 is better for rag pipeline debugging than claude opus

i have been optimizing a rag system with 12 different embedding models and retrieval strategies. Initially used claude opus 4.7 thru anthropic api for the analysis but hit walls when diagnosing performance bottlenecks across the full pipeline. The task was - how retrival failures in one component cascade thru the system embedding mismatches affaecting chunk relevance which degrades reranking… which throws off cobtext assembly.

i needed to see the entire pipeline as interconnected failure modes, opus analyzed each component well indivudually but it treated them as isolated issues instead of model cascade effects. then switched to deepsek via deepinfra api with the same logs and metrics but this time deepseek mapped the full system and showed how embedding model A's poor performance on technical jargon triggered downstream reranker failures causinjg context window pollution, creating feedback loops that opus had missed. The multi component analysis captured interdependencies that opus didnt quite hold simultaenously

opus still wins on code, no doubt on that but for tracing failure propogation across complex multi stage pipelines deepseeks analytical depth on interconnected system behaviour is much stronger. When debugging cross component issues where one failures triggers the three others deepseek identified the root cause faster usually pointing to the upstream component.

ran both the models on same 2 week diagnostic log spanning 8 million requests.. On one side opus produced 14 isolated recommendations per component while deepseek produced 6 system level changes that showed interaction failures. Implemented deepseeks suggestions first and fixed 11 of the 14 issues that opus had flagged

anyone else using multiple models for their rag debugging?? interested in hearing which model combinations you've found work best for multi-component failure analysis....

u/jasperc_6 — 3 days ago

▲ 10 r/Rag

Benchmarked Gemma 4 26B-A4B vs Ministral 14B vs Qwen3 variants on a Turkish RAG workload — small models punch way above their weight

Hey folks — spent the last week running a real-world RAG benchmark and the results surprised me enough that I wanted to share and get a sanity check from the community.

Test-Setup

Domain: Turkish-language enterprise RAG (emails, contracts, postmortems, SOWs, CSVs — 60 mixed docs)
Stack: ParadeDB (Postgres + pgvector + native BM25), Vercel AI SDK, Google embedding (1536d), no reranker yet
Test set: 20 questions across 7 categories — simple lookup, multi-hop chains, contradiction resolution (doc A says X, doc B says Y — which is current?), numeric/CSV aggregation, hallucination traps (asking about projects that don't exist), Turkish morphology variants
All models served via gateway, so latency is network-dominated, not local-inference

Models tested (open-weight only) — Q1-Q2 baseline score out of 5

Gemma 4 26B-A4B (MoE, 26B total / 4B active) — 5.0 / 5
Ministral 14B (dense) — 4.75 / 5
Qwen 3.6-27B (dense) — 4.75 / 5
Nemotron 3 Super 120B (MoE, 120B / 12B active) — 3.5 / 5
Qwen 3-30B A3B (MoE, 30B / 3B active) — 2.0 / 5
Qwen 3 Next 80B (MoE, 80B / 3B active) — 1.0 / 5

What surprised me

Big model does not equal better RAG. Qwen 3-Next 80B and Qwen 3-30B A3B both fell apart on tool-calling discipline — hallucinating arguments, skipping retrieval, confidently making things up. Nemotron 120B got the easy answer right but missed nuance. Meanwhile Gemma's 4B active params and Ministral's 14B dense crushed it.
Ministral 14B is the dark horse. Smaller footprint than anything else on the list and Turkish output quality is arguably the cleanest of the bunch. It only loses points on completeness — sometimes skips a citation Gemma would catch. For edge or laptop deployment this is hard to beat.
Gemma 4 26B-A4B is the most disciplined. Best at I-don't-know refusals (didn't hallucinate on trap questions about fictional projects). Best at multi-hop chains. Tool calls are minimal and on-target — 3 to 4 times fewer calls than Ministral for the same answer. The 4B active param MoE design is genuinely impressive here.

What about you — which models have you tried on similar RAG workloads? Any that surprised you in either direction? Anything you'd recommend I throw into the next round?

TL;DR: Gemma 4 26B-A4B wins on quality, Ministral 14B is the small-footprint sweet spot, the bigger MoE models (Qwen 3-Next 80B, Qwen 3-30B A3B) underperformed badly on tool use.

u/Lanaxsa — 3 days ago

▲ 5 r/Rag+1 crossposts

Need suggestions/validation on a Filter-first + RAG fallback architecture for Product Recommendations.

Current challenge:
-We have a product recommendation/search system where precision matters more than recall.

Client expectation is:
- ~95% queries should resolve through deterministic/filter-based retrieval
- Only ~5% should go through RAG/semantic reasoning

Reason:
- Product catalog is limited
- Pure RAG/vector search gives decent recall but poor precision
- Earlier implementation used LLMs (Claude) to generate filters directly from prompts with confidence scoring > 90, but hallucinated filters caused poor SQL retrieval quality.

What I implemented:

Instead of relying on prompt-only filter extraction, I converted metadata into embeddings.
Stored metadata in PGVector using Cohere embeddings.
Each metadata entry is aligned with:
category, subcategory, normalized attributes/tags
Retrieval flow:
Vector similarity retrieval
Hybrid reranking for better precision + recall
Retrieved metadata candidates are then used to construct filters for SQL/product retrieval.
RAG is used only as fallback when filter confidence is low or query intent is ambiguous.

Observed improvements:
Better filter consistency
Reduced hallucinated attributes
Better precision compared to prompt-only extraction
More controllable retrieval pipeline

Questions:

Is this generally the right architecture direction for enterprise product recommendations/search?
Any better approaches for:
metadata normalization
filter confidence scoring
query-to-filter mapping
reducing semantic drift?
Would knowledge graphs/taxonomy mapping help more than embeddings here?
How do teams usually decide when to invoke RAG vs deterministic retrieval?

Would appreciate suggestions from people working on enterprise search, RAG systems, recommendation engines, or e-commerce or medical retrieval pipelines.

u/Tron_tx6 — 4 days ago

▲ 17 r/Rag

Best free resources to learn RAG end-to-end?

I’m looking for the best free resources (websites, docs, courses, GitHub repos, YouTube, blogs) to learn RAG end-to-end — from fundamentals to advanced topics.
Interested in:
Types of RAG (Agentic, Graph, Multimodal, etc.)
Chunking, embeddings, retrieval, reranking
Vector DBs and frameworks
Evaluation and production best practices
Tools like LangChain, LlamaIndex, DSPy, etc.
I have a software engineering background, so technical/deep content is fine.
What resources helped you learn and build production-ready RAG systems?

u/Unstable007 — 5 days ago

▲ 12 r/Rag+1 crossposts

GraphRAG - Entity deduplication

Hi everyone,

I have a question related to GraphRAG. I have some experience applying it in the legal domain, and one recurring problem I face is entity duplication after the LLM extracts entities and relationships.

For example, the same person may appear in slightly different forms across documents, such as “jack,” “Dr. Jack,” “Jack Abbot,” or other variations. As a result, the graph ends up with multiple nodes that actually refer to the same real-world entity.

Have you encountered this issue before? If so, what approaches have worked best for resolving it?

I have tried several unification methods based on embedding similarity, but they have not fully solved the problem. I would be especially interested in practical strategies for entity canonicalization, entity resolution, or graph-level deduplication in a GraphRAG pipeline.

u/AttentionDiffuser — 5 days ago

▲ 45 r/Rag

We replaced our RAG pipeline with persistent KV cache. It works. Here’s what we found.

We’ve been running RAG in production for a while. It worked but maintaining it was a constant tax. Re-embedding on data changes, tuning chunking strategies, debugging retrieval misses, managing the vector database. Every moving part was something that could break.

So we ran an experiment. Instead of chunking and embedding documents, we loaded the full document into context, cached the KV state persistently, and reused that cache across every query.

No vector database. No embedding pipeline. No retrieval step. Just the model with full document context, warm and ready.
What we found:

• Answer quality is noticeably better . no retrieval misses, no wrong chunks, full context every time
• Updates are dramatically faster — change the document, regenerate the cache, done in minutes vs hours of re-indexing
• Operational complexity dropped significantly. no pipeline to maintain, no retrieval quality to monitor
• l Current limit is around 120k tokens. works for most business documents, not for massive corpora

Where it breaks down:
• Documents larger than context window are still a problem
• Very large document collections still need a different approach
• Cold cache on first load takes time warm queries are fast
We’re genuinely curious if others have tried this. Especially interested in:
• How your use cases map to context window limits
• Whether retrieval quality was your biggest RAG pain point or something else
• What you’d need to see to replace your RAG pipeline entirely

Happy to answer any questions

u/pmv143 — 7 days ago

▲ 6 r/Rag

I have released a CLI tool for creating micro RAG knowledge bases

Hi, I’ve released mrag (Micro RAG), a CLI tool for creating RAG knowledge bases. I developed it with the goal of making it easy for users who aren’t very familiar with RAG to experiment with creating knowledge bases locally.

Personally, I find it convenient because it makes it easy to provide small knowledge bases to agent tools like Claude Code. Also, since I work with a lot of Japanese documentation, it’s a bit Japanese-friendly. The code was 100% written by Claude Code. Please give it a try if you’d like!

https://github.com/bathtimefish/mrag

u/bathtimefish — 4 days ago

▲ 13 r/Rag

RAG GenAI development

Building GenAI development pipeline for 10-K/10-Q analysis. Legal PDFs are 300 pages with tables, footnotes, nested sections.

Tried recursive chunking, semantic chunking, and layout-aware parsing. Still getting 20% of answers missing key context from tables or mixing up fiscal years. Embeddings are text-embedding-3-large. Reranker helped but latency jumped to 4s.

For those doing RAG GenAI development on dense financial/legal docs, what chunking + metadata strategy actually works? Are you pre-processing with LLM to extract table JSON first?

u/Embarrassed_Pay1275 — 6 days ago