u/21734234

Spent the last few months debugging production AI systems for a handful of mid-to-large orgs, and I keep seeing the same failure pattern that nobody really talks about in the benchmarking literature.

The model isn't the problem. The retrieval isn't even really the problem. The problem is document heterogeneity rot.

Here's what I mean. When you first stand up a RAG system, your corpus is relatively clean. You've chunked it, embedded it, indexed it. The retrieval scores look great in eval. Then six months pass.

Now you have:

A 2023 policy doc that was superseded by a 2024 amendment that lives in a completely different folder
Meeting transcripts that reference decisions that were later reversed via email (which is not indexed)
Contracts with line-item exceptions that got negotiated verbally and exist only in someone's Outlook

Your retrieval system has no concept of document authority hierarchy. It treats a deprecated policy PDF the same as the current one because cosine similarity doesn't care about org chart logic or recency signals beyond naive metadata.

The fix isn't better chunking or a bigger embedding model. It's building provenance chains into your indexing architecture from the start so the system knows not just what a document says, but whether it's still true.

A few teams I've seen handle this well are essentially building a lightweight governance layer that sits between ingestion and retrieval tagging documents with confidence decay rates and authority signals rather than treating the corpus as a flat library.

It's more engineering overhead upfront. But it's the only thing that actually keeps production accuracy from drifting.

The reason your enterprise RAG pipeline degrades over time (it's not the model)