u/Helpful_Regular_30

▲ 0 r/Rag

Spent a weekend debugging why my RAG pipeline gave garbage answers, turned out the problem wasn't the model at all

Built a basic RAG setup a few months ago. Retrieval looked fine, model was decent, but the answers were consistently half-wrong or weirdly incomplete.

Spent way too long suspecting the LLM. Swapped models twice. Still bad.

Turned out the issue was how I was chunking documents.

I was using fixed 512-token chunks with no overlap. Clean, simple, felt logical. But the retrieved chunks kept cutting sentences mid-thought, sometimes right before the actual answer, sometimes right after. The model was working with literally incomplete information and hallucinating the rest.

What actually helped:

1. Adding overlap (obvious in hindsight) Went from 0 overlap to ~50 tokens. Retrieval quality jumped immediately. The "answer" wasn't getting split across two chunks anymore.

2. Respecting natural document boundaries Splitting by paragraph or section instead of raw token count made a huge difference for structured documents like PDFs and docs with headers.

3. Smaller chunks + more of them Counterintuitive but retrieving 6 small clean chunks beat retrieving 3 large messy ones. Less noise in the context window.

4. Checking what actually got retrieved I wasn't logging retrieved chunks at all early on. Once I started printing them, I immediately saw the problem. Obvious step I skipped because I assumed retrieval was working.

The model was never the bottleneck. The garbage-in-garbage-out problem was upstream the whole time.

Curious if others ran into this, especially with PDFs. Those feel like a special kind of painful.

reddit.com
u/Helpful_Regular_30 — 4 hours ago