There's a specific kind of frustration that every developer knows. You're in the middle of something, you hit a wall, and you open the PyTorch docs. Twenty minutes later, you've read three pages, followed two rabbit holes, and you still haven't found the one line you needed.

I got tired of that. So I built something about it.

In four hours on a single GPU instance, I put together a system that lets you ask plain English questions and get answers pulled directly from real documentation — cited, grounded, no hallucination. Ask it "how do I move a model to GPU?" and it tells you .to(device), points you to exactly which file that came from, and moves on.

Here's how it went.

Result First

Before anything else, this is what it actually looks like in practice.

Q: How do I move a PyTorch model to a GPU?

https://preview.redd.it/vyl685rl8wzg1.png?width=1280&format=png&auto=webp&s=4e907f63366d625f716b68575fd5a42e269e04cb

Q: How do I use a tokenizer with Hugging Face Transformers?

https://preview.redd.it/qmf7j0sq9wzg1.png?width=1280&format=png&auto=webp&s=b733b846938047989e0f254b9fb1e411330c0cfa

Q: How do I use Dataloader in PyTorch?

https://preview.redd.it/cpny6k8r9wzg1.png?width=1280&format=png&auto=webp&s=da5f2bb7c9ed9e53d3e1ff4088f5c25156e6d8be

I also built a second version of this — the same architecture, but pointed at internal office documents instead of PyTorch. HR policies, IT procedures, and finance reimbursement guides.

An employee asks, "How do I request annual leave?" and gets a cited answer in under 2 seconds. Same idea, completely different world.

Q: How do I request annual leave?

https://preview.redd.it/5m6dedju9wzg1.png?width=1280&format=png&auto=webp&s=96b22caf0fb282af286421e08d594a3e7c7c9038

Q: How do I submit a travel reimbursement?

https://preview.redd.it/tivazy7v9wzg1.jpg?width=1280&format=pjpg&auto=webp&s=31d162aef9b12430ccd1c96c943021f6ac5f2a1c

Q: Who should I contact for IT support?

https://preview.redd.it/t0supqhx9wzg1.png?width=1280&format=png&auto=webp&s=86ba7fff79775a14e44793ea4a0da1936c49765a

Both versions. One afternoon. One GPU. This becomes genuinely useful anywhere people are tired of manually searching through documentation — whether that’s developers jumping between hundreds of pages to find a single method, teams building internal assistants that understand their own codebase or company policies, or new hires trying to onboard into an unfamiliar framework, tool, or organization without constantly asking someone else for help.

The Concept

This pattern is called RAG — Retrieval-Augmented Generation. The name is a mouthful, but the idea is simple: instead of asking a language model to answer from memory (where it might hallucinate), you first retrieve the relevant text from a real source, then ask the model to generate an answer based only on what was retrieved.

It's the difference between asking someone to guess an answer and handing them the right page of a textbook first.

Here's the full flow:

https://preview.redd.it/nqrma0jy9wzg1.png?width=628&format=png&auto=webp&s=121bb05e3a3040e9c8161d1339b968acb10c81f7

The key insight: the LLM never has to know PyTorch from memory. It only has to read what you hand it. That's what keeps the answers grounded and the sources honest.

Step by Step

1. Setup

Everything ran on a single nstance for a Cloud GPU platform, One GPU. No cluster. No expensive infrastructure. That matters — it means this is something you can actually replicate.

https://preview.redd.it/knx64osz9wzg1.png?width=1057&format=png&auto=webp&s=78cb2d609f26bdf929580804fa6d4d5516c5d662

GPU	NVIDIA RTX 5090 — 32GB VRAM
CUDA	13.0
Framework	PyTorch
Cost	$0.38 / hour
Region	Singapore-A

2. Data

Developer Assistant: I pulled the actual source repositories — not a curated sample, the real thing:

git clone --depth 1 https://github.com/huggingface/transformers.git
git clone --depth 1 https://github.com/pytorch/pytorch.git

884 corpus files across both repos
~6.2 million characters of raw text
9,192 chunks after splitting

That's a realistic knowledge base. Not a demo dataset with 20 files. The kind of scale where retrieval actually has to work.

https://preview.redd.it/6uwuycx0awzg1.png?width=499&format=png&auto=webp&s=97a84989b930eab5aa9f7359a44bfba459c4afa0

Office Knowledge Assistant: For the internal office version, I generated a structured synthetic dataset simulating a real company's internal documentation:

300 documents across HR, IT, Finance, Operations, and Admin
Topics: leave policy, remote work, reimbursement, VPN access, onboarding, and more
~600 chunks after splitting

Smaller scale, but deliberately structured to mirror the messy reality of how internal company knowledge actually lives — spread across departments, sometimes overlapping, never perfectly organized.

https://preview.redd.it/f6ivxnx1awzg1.png?width=547&format=png&auto=webp&s=82286915d69c1d43b9415a826064c142ec1ff161

3. Collect and Prepare the Documents

For the developer assistant, this meant cloning the repos. For the office assistant, generating the document set. Either way, the output is the same: a folder of raw text files representing your knowledge domain.

The preprocessing step strips noise, normalizes whitespace, and converts everything to clean UTF-8 text. Nothing fancy — just making sure the data is consistent before it gets split up.

# prepare_corpus.py — simplified version
import os

def prepare_corpus(source_dir, output_dir):
    files_processed = 0
    total_chars = 0
    
    for root, _, files in os.walk(source_dir):
        for fname in files:
            if fname.endswith(('.rst', '.md', '.txt')):
                with open(os.path.join(root, fname), 'r', errors='ignore') as f:
                    text = f.read()
                
                # Clean and normalize
                text = clean_text(text)
                
                # Write to output
                out_path = os.path.join(output_dir, fname)
                with open(out_path, 'w') as f:
                    f.write(text)
                
                files_processed += 1
                total_chars += len(text)
    
    print(f"Prepared corpus files: {files_processed}")
    print(f"Total characters: {total_chars:,}")

Output when you run it:

Prepared corpus files: 884 
Total characters: 6,264,627
Output directory: /root/dev_doc_rag/corpus

3. Chunking

You can't embed an entire 50-page document as a single vector — the signal gets lost. You split it into chunks, small enough to be semantically focused, large enough to contain a complete thought.

The important detail: overlapping chunks. If you split cleanly at every 512 tokens, you'll sometimes cut a sentence right in the middle of the answer. Overlap means each chunk shares some content with its neighbors, so nothing falls through the cracks.

def chunk_text(text, chunk_size=512, overlap=50):
    words = text.split()
    chunks = []
    
    for i in range(0, len(words), chunk_size - overlap):
        chunk = ' '.join(words[i:i + chunk_size])
        if chunk:
            chunks.append(chunk)
    
    return chunks

Result: 9,192 chunks from 884 files. Each chunk is one searchable unit.

4. Embedding

Every chunk gets converted into a dense vector — a list of numbers that represents its semantic meaning. Chunks that mean similar things will have similar vectors, even if they use different words. That's what makes semantic search work.

FAISS (Facebook AI Similarity Search) stores all those vectors and makes it fast to find the closest matches to any new query.

from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

# Load embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Embed all chunks
embeddings = model.encode(chunks, show_progress_bar=True)
embeddings = np.array(embeddings).astype('float32')

# Build FAISS index
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(embeddings)

print(f"Indexed {index.ntotal} chunks")

Indexing 9,192 chunks on the RTX 5090 took ~13.43 seconds. Once the index is built, it lives in memory and queries hit it in milliseconds.

5. Query Processing

When a user asks a question, the same embedding model converts it to a vector, FAISS finds the top-k most similar chunks, and those chunks get handed to the LLM as context.

def answer_question(query, index, chunks, model, llm, k=5):
    # Embed the query
    query_embedding = model.encode([query]).astype('float32')
    
    # Retrieve top-k chunks
    distances, indices = index.search(query_embedding, k)
    relevant_chunks = [chunks[i] for i in indices[0]]
    
    # Build prompt
    context = "\n\n".join(relevant_chunks)
    prompt = f"""Answer the following question based only on the provided context.
    
Context:
{context}

Question: {query}

Answer:"""
    
    # Generate answer
    response = llm.generate(prompt)
    sources = [chunk_sources[i] for i in indices[0]]
    
    return response, sources

The LLM doesn't browse the internet. It doesn't guess. It reads what FAISS found and answers from that. That's the whole trick.

The Performance

Developer Documentation Assistant

Metric	Result
Indexing time	13.43 seconds
Query latency	2.3 – 2.6 seconds
Files indexed	884
Total chunks	9,192

Office Knowledge Assistant

Metric	Result
Indexing time	8.35 seconds
Query latency	0.15 – 1.96 seconds
Files indexed	300
Total chunks	~600

The office assistant is faster because it's a smaller index, fewer vectors to search. The developer assistant handles a 15x larger dataset and still responds in under 3 seconds. Both are interactive. Neither requires a cluster.

https://preview.redd.it/a87mfw33awzg1.png?width=565&format=png&auto=webp&s=a0919280ac3a94404028af8eb48ae4533f7a11c8

Warning Honestly

Smaller models drift. When I used a lighter LLM for generation, the answers occasionally padded themselves with unnecessary detail or made small inferential leaps that weren't in the source text. Bigger models stay closer to the retrieved content.
Similar documents confuse retrieval. If you have 10 files that all describe the same leave policy with slightly different wording, FAISS might return 5 of them as top-k for one query. The answer might be fine, but the sources feel redundant.
Synthetic data has limits. The office assistant ran on documents I generated to simulate company policies. Real internal documents are messier — inconsistent formats, missing context, ambiguous wording. The system would need more careful preprocessing in a real deployment.

Both systems I built are deliberately simple. A developer doc assistant. An office knowledge base. But the same four steps — collect, chunk, embed, query — apply to a much wider surface area than that.

Think about what "documents your team is tired of searching" looks like in different contexts:

Legal teams have contracts, clauses, and precedents. Instead of a lawyer spending an hour locating a specific indemnification clause across 200 past contracts, a RAG system retrieves it in seconds.
Support teams have ticket histories, resolution logs, and product manuals. A RAG assistant trained on past resolved tickets can suggest answers to new ones automatically, cutting handling time dramatically.
Research teams have papers, notes, and literature reviews. Ask "what did the 2023 papers say about attention mechanisms in vision transformers?" and get a synthesized answer with citations, instead of manually rereading 40 PDFs.
Onboarding is a particularly compelling one. Every company has a mountain of documentation that new hires need to absorb in their first few weeks. Instead of burying them in Notion pages, give them a system they can just ask. The knowledge is already there — it just needs to be made queryable.

The architecture doesn't change. The embedding model doesn't change. FAISS doesn't change. What changes is the folder of documents you point it at.

That's the part I find genuinely interesting about this — it's a general-purpose tool dressed up as a specific solution. Once you understand the pipeline, you start seeing document retrieval problems everywhere.

Four hours. One GPU. Two working systems.

The developer documentation assistant handles 884 real files from PyTorch and Hugging Face, answers in under 3 seconds, and cites its sources. The office assistant handles 300 internal policy documents across 5 departments and responds in under 2 seconds.

Neither of these is rocket science. The pieces — FAISS, sentence transformers, a language model — are all open and well-documented. What this project is really about is putting them together in the right order and pointing them at a real problem.

If you're sitting on a pile of documentation that people in your team are tired of searching through manually, this is the pattern you want. The setup cost is one afternoon. The payoff is a system that keeps working after you've moved on to the next thing.

That's a trade I'll take every time.

u/Financial_Ad8530

Result First

The Concept

Step by Step

The Performance

Warning Honestly