u/pmv143

We’ve been running RAG in production for a while. It worked but maintaining it was a constant tax. Re-embedding on data changes, tuning chunking strategies, debugging retrieval misses, managing the vector database. Every moving part was something that could break.

So we ran an experiment. Instead of chunking and embedding documents, we loaded the full document into context, cached the KV state persistently, and reused that cache across every query.

No vector database. No embedding pipeline. No retrieval step. Just the model with full document context, warm and ready.
What we found:

• Answer quality is noticeably better . no retrieval misses, no wrong chunks, full context every time
• Updates are dramatically faster — change the document, regenerate the cache, done in minutes vs hours of re-indexing
• Operational complexity dropped significantly. no pipeline to maintain, no retrieval quality to monitor
• l Current limit is around 120k tokens. works for most business documents, not for massive corpora

Where it breaks down:
• Documents larger than context window are still a problem
• Very large document collections still need a different approach
• Cold cache on first load takes time warm queries are fast
We’re genuinely curious if others have tried this. Especially interested in:
• How your use cases map to context window limits
• Whether retrieval quality was your biggest RAG pain point or something else
• What you’d need to see to replace your RAG pipeline entirely

Happy to answer any questions

We replaced our RAG pipeline with persistent KV cache. It works. Here’s what we found.