u/Altruistic_Night_327

Wanted to share an approach I've been using for retrieval-augmented generation over large codebases and get feedback from people thinking about similar problems.

The problem Naive codebase RAG typically works by chunking files into text segments and embedding them for similarity search. This breaks down on code because semantic similarity at the chunk level doesn't capture structural relationships — a function in file A calling a type defined in file C won't surface that dependency through embedding proximity alone.

The approach: AST-derived typed graphs Instead of chunking, I parse every file using Tree-sitter into its AST, then extract a typed node/edge graph:

  • Nodes: functions, classes, interfaces, types, modules
  • Edges: imports, exports, call relationships, inheritance, composition

This gets stored in SQLite as a persistent graph. Parse cost is one-time per project.

Retrieval: BM25 over graph nodes At query time, instead of embedding similarity, I run BM25 scoring over node metadata (names, signatures, docstrings, file paths). Top-scoring nodes get passed to the LLM. The graph structure means a retrieved function automatically pulls in its direct dependencies via edge traversal.

Empirically this lands at ~5K tokens per query on medium-large codebases that would otherwise require ~100K tokens with naive full-context approaches.

Hierarchical fallback for complex queries For multi-file reasoning tasks:

  1. A Mermaid diagram of the full graph serves as a persistent architectural map always in context
  2. BM25 node retrieval handles targeted lookup
  3. At 70% context capacity, a fast model compresses least-relevant nodes before passing to the primary model

Why BM25 over embeddings here Code identifiers (function names, type names, module paths) are highly distinctive lexically. BM25 outperforms embedding similarity on exact and near-exact identifier matching, which is the dominant retrieval pattern in code queries. Embeddings would likely help more for natural language docstring queries — haven't benchmarked that comparison rigorously yet.

Open questions I'm still thinking about:

  • Better edge-weighting strategies for the graph — currently all edges are unweighted
  • Whether re-ranking with a cross-encoder would meaningfully improve precision over BM25 alone
  • Handling dynamic languages where call graphs can't be fully resolved statically

Has anyone tackled codebase-scale RAG differently? Particularly curious if anyone's compared AST-graph approaches against embedding-based chunk retrieval on real codebases with quantitative benchmarks.

reddit.com
u/Altruistic_Night_327 — 14 days ago

Been building an AI coding tool and kept hitting the same wall: feeding a real codebase to an LLM burns through context fast. A medium production project hits ~100K tokens easily. That's expensive, slow, and the model starts hallucinating file relationships.

Here's the approach I landed on:

Step 1 — Parse into a typed graph Tree-sitter AST walks every file and extracts functions, classes, interfaces, imports, exports, and call relationships. This gets stored as a node/edge graph in SQLite. One-time cost, persistent across sessions.

Step 2 — BM25 scoring at query time Instead of re-reading files, every query scores the graph nodes by relevance using BM25. Only top-scoring nodes go to the LLM. Everything else stays in the database.

Step 3 — Hierarchical fallback For complex queries: a Mermaid diagram acts as a persistent high-level codebase map, BM25 handles targeted retrieval, and at 70% context capacity a fast model compresses the least relevant nodes before passing to the main model.

Result: ~5K tokens per query instead of ~100K. Provider-agnostic — works the same whether you're on GPT-4o, Claude, Gemini, or a local Ollama model.

Happy to go deeper on any part of this — the BM25 implementation, the graph schema, or the compression layer. Anyone else tackling codebase RAG differently?

reddit.com
u/Altruistic_Night_327 — 14 days ago

Been building Atlarix in Electron + React + TypeScript for about a year. Wanted to share the CI/CD setup since it took a while to get right.

The pipeline handles: macOS build with Apple Notarization via xcrun notarytool, hardened runtime codesigning, Linux packaging across all three formats, and automated release publishing — all triggered on tag push.

The hardened runtime + notarization part specifically was painful. Happy to share the Actions config if anyone's fighting with that.

App itself is an AI coding environment with local model support (Ollama/LM Studio). atlarix.dev if curious, but mainly posting because the Electron + GitHub Actions setup might save someone a headache.

reddit.com
u/Altruistic_Night_327 — 17 days ago