We indexed 78,000 public domain books on self-hosted Qwen models. Here’s what the RAG pipeline looks like and what we learned
I’m part of a small team running our own GPU infrastructure in Gijón, northern Spain. It’s part-powered by solar and fully self-hosted. So no cloud and no external API calls.
In collaboration with Project Gutenberg, we built projectgutenberg.empathy.ai, which is a semantic discovery layer over their entire library.
I wanted to share this because scaling self-hosted open-source models to this size has brought up some interesting challenges for us, and some of the solutions we landed on might be useful for what people here are building now or in the future.
There are some interesting conversations in this subreddit about RAG and hallucinations, so I’ve added details on those too.
Why this is a harder retrieval problem than it looks
Traditional book discovery is metadata. Things like genre tags, author matching and purchase behaviour. But, it doesn’t work for queries that matter in this context. A query like “Something with the existential weight of Dostoevsky but shorter” doesn’t return anything useful from a genre filter.
What we wanted was intent matching. The problem is that a search like “something hopeful but not naive” has zero lexical overlap with the passages that would satisfy it. The signal you’re matching against isn’t keywords, it’s narrative structure, emotional arc, and thematic patterns.
The stack
The models are all running on our own hardware in Asturias. It’s all open-weight and auditable. Importantly for us, there’s no reliance on Open AI etc or AWS.
- Qwen3.5-2B
- Qwen2.5-7B-Instruct
- Qwen3.5-9B
- Qwen3-8B-FP8
- Qwen3.6-27B-FP8
- Qwen3-30B-A3B-Instruct-2507-FP8
The ingestion pipeline
Documents go through five sequential phases: fetching, transforming, enriching, storing, and post-processing. For me, the interesting part happens in enriching.
After token-splitting, every chunk goes through an LLM-powered contextual enrichment step. Basically each chunk gets a precise summary of where it sits in the broader document before it ever reaches the vector store. This is what makes retrieval work at this scale.
A chunk that reads “he could not forgive himself” is nearly useless on its own. But within its context (eg. which character, which moment, which book) it becomes retrievable for the right query.
This approach draws on Anthropic’s published contextual retrieval research, which showed 60%+ reduction in retrieval failures. Their research is open, but the implementation and inference are entirely ours.
On hallucinations and how we address them
This comes up often in RAG discussions and I’ve seen it in many other threads. So, three things that actually worked for us:
Citations as the only honest check:
Every response surfaces the source passage it drew from. If the cited passage doesn’t support the claim, then the system lied. There’s no other mechanism that makes output trustworthy without re-reading every source yourself.
Reranking before generation:
Chunks are scored for relevance before reaching the model. Most lightweight RAG skips this, but most of the risk for hallucination lives here.
Intent expansion before retrieval:
The natural language query gets translated into the semantic space the index lives in before retrieval fires. Most of the quality difference comes from this step, not the model size or context window.
Happy to go deeper on any of the pipeline decisions in the comments.
You can try it out yourself: