I used Claude Code to build the same web app 3 different ways (cloud Claude, free NVIDIA NIM, local GPU) to see how they compare
TLDR: Local LLMs for agentic coding went from "not a chance" to "actually works" for me once I found MoE models that can offload experts to RAM. Still slower than real Claude, but I was surprised how far it got, and could see that opensource local llm can, and will eventually replace cloud ai.
Background
I use VS Code + Claude Code (paid) at work and wanted to see how close you can get to that experience locally, either for "free as in freedom" reasons or just curiosity about where things actually are.
The test I came up with: I have a real app I built over months (SaltyChart, seasonal anime watchlist/rankings/wheel spinner) and I turned it into a spec file. Then I gave that spec to three different setups and said "build it." Same starting point, same task, see what happens.
Hardware: RTX 3080 10GB VRAM, 96GB DDR4-3400 RAM, Intel(R) Core(TM) i5-12600K, Windows 11
Step 1: Finding an IDE setup that actually works
I tried Cline, Continue, and Roo Code with free LLMs and couldn't get any of them working the way I wanted. Maybe that's on me, but I kept running into config issues or UX that just felt wrong. Cursor was genuinely great... right up until it asked for a subscription when I brought my own backend. Hard pass.
What I actually wanted was just "Claude Code but pointed at a different model." Turns out that's a thing. Claude Code supports a custom ANTHROPIC_BASE_URL, and clawgate handles the translation from Anthropic API format to OpenAI format that your local server expects. free-claude-code does something similar if clawgate doesn't work for you.
Step 2: Testing NVIDIA NIM free tier
build.nvidia.com gives you free API access to some large models. The catch is you have no idea what speed you'll get, and it varies constantly. I built a benchmark tool to check TTFT and tok/s before starting a real session, because at under ~40 tok/s coding gets painful. You're waiting too long between actions and it's hard to catch mistakes before the model goes too far down the wrong path.
The large models (Qwen3.5-122B, Mistral Medium 3.5 128B) were usable when they had bandwidth. They made fewer mistakes and could handle planning better. But usually only one model has decent throughput at a time, and it shifts around, so I was spending 15-20 min benchmarking before I could start anything.
The NIM run got through M1-M3 of my spec over a few days. Project is here. In hindsight the results were worse than I thought though. The planning doc the model wrote said M3 was complete, but when I actually looked at the code it was mostly stubs with one big "initial commit." I didn't catch this at the time because I didn't dig in deeply enough. This is a pattern with smaller models: they'll tell you something is done, or write a planning doc describing work as complete, when the actual implementation isn't there. You really do have to go back and verify.
Step 3: Dense models locally
Based on some outdated info I was looking at ~7B dense models as what would fit on 10GB VRAM. I tried using them to build the project planning doc and they just couldn't do it. Got stuck in loops, couldn't hold enough context to make good architectural decisions. They're fine for code completion, not for planning a whole project.
At this point I figured local agentic coding required either a 32GB GPU or a 128GB shared-memory box. Both $2000+.
Step 4: MoE models
Found more current info on Mixture-of-Experts models and specifically on llama.cpp's --n-cpu-moe flag. The idea: MoE models are large in total parameter count but only activate a small fraction per token. For Qwen3.6-35B-A3B-UD-IQ3_XXS that's 35B total but only ~3B active per token (256 experts, ~8 selected per layer). The attention layers and shared weights stay on VRAM, expert layers spill to RAM. On my setup with 24 expert layers offloaded:
- ~50 tok/s generation (warm turns)
- ~12s cold start on large contexts, fast after that
- 9,190 MB peak VRAM, just fits
EvalPlus HumanEval+ score: 92.7% pass@1. That matched the big 122B model I was testing on NIM, but running at 50 tok/s instead of 11-27 tok/s.
Getting --n-cpu-moe right took some work. The VRAM readings you get at idle are meaningless. You need to measure under actual inference load. I wrote a binary search script that loads a real 86K Claude Code request and finds the highest n-cpu-moe that doesn't OOM.
Step 5: TurboQuant detour
I tried the TurboQuant fork of llama.cpp for its smaller KV-cache quantization, which would let me keep more of the context active. Hit a nasty bug though. Qwen3 uses a hybrid attention architecture combining standard softmax attention and GatedDeltaNet layers. The TurboQuant fork was missing the SWA (Sliding Window Attention) / hybrid attention KV cache fix that mainline llama.cpp already had. Without that fix, the KV cache was getting invalidated on every request, so the model was doing a full context prefill on every single turn instead of only on new tokens. Warm turns that should be 0.1s were taking 12+ seconds. This is tracked in the TurboQuant issues (currently as a Gemma4 request to merge the upstream fix, but it's the same underlying problem).
Switched back to mainline llama.cpp b9143 which had the fix already. Moved a few more expert layers to RAM to fit the KV cache, but the speed difference was massive.
Step 6: Getting Claude Code actually working locally
Even with a fast model there were several Claude Code-specific things to sort out.
The stack:
Claude Code (VS Code) -> rate_proxy (:8083) -> clawgate (:8082) -> llama-server (:8081)
clawgate handles the format translation. I needed an extra proxy layer (rate_proxy.py) for two things:
- Token counting. Claude Code calls
/v1/messages/count_tokensto know when to auto-compact the context. If this breaks or returns wrong numbers, auto-compact never fires and you eventually hit the context limit mid-task. llama-server b9143 handles this endpoint natively, so the proxy just passes it through. - Adaptive thinking injection. Qwen3 supports a thinking mode via
/thinkand/no_thinkin the system prompt. Thinking costs tokens but helps on hard problems. The proxy injects/no_thinkon normal turns to save 500-2000 tokens, and removes it on error turns so the model can actually reason through what went wrong. Server runs with--reasoning autoso the model can think when the injection is absent.
Claude Code settings that actually mattered:
CLAUDE_CODE_ATTRIBUTION_HEADER=0 is the big one. Claude Code injects a billing header that includes a hash changing every single request. That hash is part of the prefill, so without this flag every turn is a cold start. With it: 0.1s warm turns. Without it: 12s+ every turn. That's a 120x difference on warm turns.
CLAUDE_CODE_AUTO_COMPACT_WINDOW=131072 tells Claude Code the actual context window is 128K instead of whatever the model's nominal spec says. Otherwise auto-compact fires at the wrong threshold or not at all.
CLAUDE_AUTOCOMPACT_PCT_OVERRIDE=85 makes auto-compact fire at 85% of context so there's room for the summary.
MCP tools used:
- serena-slim for file editing. Better than the default read-the-whole-file-and-rewrite pattern on large files.
- context7 for live library docs. Local models have older training cutoffs and context7 pulls current documentation on demand.
- Playwright is built into Claude Code natively and lets the model spin up a browser, navigate, and verify UI behavior directly.
Results
| Claude Sonnet 4.6 | NVIDIA NIM (free) | Local Qwen3.6-35B-A3B-UD-IQ3_XXS | |
|---|---|---|---|
| Milestones completed | M0-M9 (all 9) | M0-M3 (with gaps) | M0-M3 (solid) |
| Unit tests | 47/47 | 14/14 | 39/39 |
| Deployable? | Yes, fully | Barely | Yes (browse-only) |
| Time | One evening (~5 hours) | A few days | Each milestone took days |
Claude Sonnet 4.6 built all 9 milestones in a single evening. Complete feature set: wheel spinner with confetti and tick sound, side-by-side compare view with PNG export, full watchlist with pre/post-watch rankings. Not pixel-perfect but shippable. Honestly impressive, and it's why I still pay for the subscription.
NVIDIA NIM free got through M1-M3 over a few days. I spent the least time with this one and the results were weaker than I expected when I went back and looked. The planning doc said M3 was done. The actual code was mostly stubs. This is a real problem with smaller/less capable models: they'll claim something is complete when it isn't. You have to keep going back and asking "are you actually sure that's done?" or just checking the code yourself.
Local Qwen3.6-35B also got through M0-M3 over a few days per milestone. Same over-reporting problem applies here too, more so than with the bigger NIM models. It makes mistakes constantly, but it doesn't loop. It'll go down the wrong path, hit a failing test, and eventually self-correct. With unit tests running on every save and some patience to let it run overnight, it does get there. It's just slow and needs more checking.
Conclusion
When I started this I thought local agentic coding on consumer hardware wasn't viable unless you were buying $2000+ of new gear. Dense 7B models confirmed that impression. MoE changed it.
Qwen3.6-35B-A3B on my 10GB VRAM machine hits 92.7% on EvalPlus, runs at 50 tok/s locally, and once all the Claude Code settings are sorted out it functions as a real coding agent. It makes more mistakes than cloud Claude, it's slower, and you need to babysit it more. But it works, it's fully local, and the hardware requirements aren't what I thought they were a year ago.
If you're doing this, the things that bit me hardest: CLAUDE_CODE_ATTRIBUTION_HEADER=0 is the single highest-leverage setting you'll touch. Claude Code injects a per-request billing hash (cch) that changes every turn and becomes part of the prefill, so every request is a cold start unless you disable it. On an 86K context that's 12s TTFT per turn vs 0.1s. One env var. The SWA/hybrid-attention KV cache bug will silently do the same thing if you're on a fork that hasn't picked up the upstream fix. And smaller models will confidently declare something done when it isn't actually built. You have to read the code, not just the summary.
I'd love to know what others are doing with their setup. What I missed. And how to make my setup better.
Edit: add CPU, and Local Model