u/Mameiro

I’ve been working on RAG systems where the knowledge base is not only internal documents, but also live web content.

One thing surprised me:

The LLM was not always the weakest part.

The retrieval layer was.

With internal docs, the corpus is at least somewhat controlled. But with live web retrieval, the system often gets:

- SEO pages with weak substance

- outdated docs that still rank well

- duplicate articles

- snippets that are too vague to cite

- pages that are related but don’t actually answer the question

- useful facts buried under a lot of irrelevant content

In those cases, the model may sound confident, but it is really just reasoning over messy evidence.

This made me think that web retrieval for RAG should not be treated as “search results for an LLM.”

It should be treated as an evidence layer.

For RAG, I now care less about just title + URL + snippet, and more about whether each retrieved item has:

- source type

- publication or modified date

- extracted passage

- canonical URL

- deduplication

- ranking/confidence signal

- citation-ready metadata

Latency also became a bigger issue than I expected.

In agentic workflows, retrieval may happen multiple times:

query rewrite
web retrieval
source filtering
reranking
generation
verification retrieval

So even small delays compound quickly. I’m starting to think retrieval latency should be measured separately from generation latency, especially p95/p99.

The hardest cases are hybrid systems:

- internal docs

- vendor docs

- GitHub issues

- changelogs

- community discussions

- recent web pages

Ranking across these evidence types is not obvious.

Should a fresh vendor doc outrank an older internal doc?

Should GitHub issues count as reliable evidence?

Should community discussions ever be used in final answers?

Should internal policy always override public documentation?

I don’t think a single top-k retrieval step is enough for this kind of setup.

What I’m currently testing is a pipeline like:

detect query intent
choose retrieval scope
retrieve from web/internal sources
dedupe
filter by freshness/source type
rerank
format results as structured evidence
generate with citation constraints

Curious how others are handling this.

For production RAG systems with live web retrieval:

- Do you merge web results with vector DB results, or keep them separate?

- How do you decide when to use web retrieval?

- Do you rank official docs differently from forums/GitHub issues?

- Are you measuring retrieval latency separately?

- How do you handle stale pages that still rank well?

Live web retrieval in RAG is harder than I expected — it behaves more like an evidence layer than search