TLDR: Local LLMs for agentic coding went from "not a chance" to "actually works" for me once I found MoE models that can offload experts to RAM. Still slower than real Claude, but I was surprised how far it got, and could see that opensource local llm can, and will eventually replace cloud ai.

Background

I use VS Code + Claude Code (paid) at work and wanted to see how close you can get to that experience locally, either for "free as in freedom" reasons or just curiosity about where things actually are.

The test I came up with: I have a real app I built over months (SaltyChart, seasonal anime watchlist/rankings/wheel spinner) and I turned it into a spec file. Then I gave that spec to three different setups and said "build it." Same starting point, same task, see what happens.

Hardware: RTX 3080 10GB VRAM, 96GB DDR4-3400 RAM, Intel(R) Core(TM) i5-12600K, Windows 11

Step 1: Finding an IDE setup that actually works

I tried Cline, Continue, and Roo Code with free LLMs and couldn't get any of them working the way I wanted. Maybe that's on me, but I kept running into config issues or UX that just felt wrong. Cursor was genuinely great... right up until it asked for a subscription when I brought my own backend. Hard pass.

What I actually wanted was just "Claude Code but pointed at a different model." Turns out that's a thing. Claude Code supports a custom ANTHROPIC_BASE_URL, and clawgate handles the translation from Anthropic API format to OpenAI format that your local server expects. free-claude-code does something similar if clawgate doesn't work for you.

Step 2: Testing NVIDIA NIM free tier

build.nvidia.com gives you free API access to some large models. The catch is you have no idea what speed you'll get, and it varies constantly. I built a benchmark tool to check TTFT and tok/s before starting a real session, because at under ~40 tok/s coding gets painful. You're waiting too long between actions and it's hard to catch mistakes before the model goes too far down the wrong path.

The large models (Qwen3.5-122B, Mistral Medium 3.5 128B) were usable when they had bandwidth. They made fewer mistakes and could handle planning better. But usually only one model has decent throughput at a time, and it shifts around, so I was spending 15-20 min benchmarking before I could start anything.

The NIM run got through M1-M3 of my spec over a few days. Project is here. In hindsight the results were worse than I thought though. The planning doc the model wrote said M3 was complete, but when I actually looked at the code it was mostly stubs with one big "initial commit." I didn't catch this at the time because I didn't dig in deeply enough. This is a pattern with smaller models: they'll tell you something is done, or write a planning doc describing work as complete, when the actual implementation isn't there. You really do have to go back and verify.

Step 3: Dense models locally

Based on some outdated info I was looking at ~7B dense models as what would fit on 10GB VRAM. I tried using them to build the project planning doc and they just couldn't do it. Got stuck in loops, couldn't hold enough context to make good architectural decisions. They're fine for code completion, not for planning a whole project.

At this point I figured local agentic coding required either a 32GB GPU or a 128GB shared-memory box. Both $2000+.

Step 4: MoE models

Found more current info on Mixture-of-Experts models and specifically on llama.cpp's --n-cpu-moe flag. The idea: MoE models are large in total parameter count but only activate a small fraction per token. For Qwen3.6-35B-A3B-UD-IQ3_XXS that's 35B total but only ~3B active per token (256 experts, ~8 selected per layer). The attention layers and shared weights stay on VRAM, expert layers spill to RAM. On my setup with 24 expert layers offloaded:

~50 tok/s generation (warm turns)
~12s cold start on large contexts, fast after that
9,190 MB peak VRAM, just fits

EvalPlus HumanEval+ score: 92.7% pass@1. That matched the big 122B model I was testing on NIM, but running at 50 tok/s instead of 11-27 tok/s.

Getting --n-cpu-moe right took some work. The VRAM readings you get at idle are meaningless. You need to measure under actual inference load. I wrote a binary search script that loads a real 86K Claude Code request and finds the highest n-cpu-moe that doesn't OOM.

Step 5: TurboQuant detour

I tried the TurboQuant fork of llama.cpp for its smaller KV-cache quantization, which would let me keep more of the context active. Hit a nasty bug though. Qwen3 uses a hybrid attention architecture combining standard softmax attention and GatedDeltaNet layers. The TurboQuant fork was missing the SWA (Sliding Window Attention) / hybrid attention KV cache fix that mainline llama.cpp already had. Without that fix, the KV cache was getting invalidated on every request, so the model was doing a full context prefill on every single turn instead of only on new tokens. Warm turns that should be 0.1s were taking 12+ seconds. This is tracked in the TurboQuant issues (currently as a Gemma4 request to merge the upstream fix, but it's the same underlying problem).

Switched back to mainline llama.cpp b9143 which had the fix already. Moved a few more expert layers to RAM to fit the KV cache, but the speed difference was massive.

Step 6: Getting Claude Code actually working locally

Even with a fast model there were several Claude Code-specific things to sort out.

The stack:

Claude Code (VS Code) -&gt; rate_proxy (:8083) -&gt; clawgate (:8082) -&gt; llama-server (:8081)

clawgate handles the format translation. I needed an extra proxy layer (rate_proxy.py) for two things:

Token counting. Claude Code calls /v1/messages/count_tokens to know when to auto-compact the context. If this breaks or returns wrong numbers, auto-compact never fires and you eventually hit the context limit mid-task. llama-server b9143 handles this endpoint natively, so the proxy just passes it through.
Adaptive thinking injection. Qwen3 supports a thinking mode via /think and /no_think in the system prompt. Thinking costs tokens but helps on hard problems. The proxy injects /no_think on normal turns to save 500-2000 tokens, and removes it on error turns so the model can actually reason through what went wrong. Server runs with --reasoning auto so the model can think when the injection is absent.

Claude Code settings that actually mattered:

CLAUDE_CODE_ATTRIBUTION_HEADER=0 is the big one. Claude Code injects a billing header that includes a hash changing every single request. That hash is part of the prefill, so without this flag every turn is a cold start. With it: 0.1s warm turns. Without it: 12s+ every turn. That's a 120x difference on warm turns.

CLAUDE_CODE_AUTO_COMPACT_WINDOW=131072 tells Claude Code the actual context window is 128K instead of whatever the model's nominal spec says. Otherwise auto-compact fires at the wrong threshold or not at all.

CLAUDE_AUTOCOMPACT_PCT_OVERRIDE=85 makes auto-compact fire at 85% of context so there's room for the summary.

MCP tools used:

serena-slim for file editing. Better than the default read-the-whole-file-and-rewrite pattern on large files.
context7 for live library docs. Local models have older training cutoffs and context7 pulls current documentation on demand.
Playwright is built into Claude Code natively and lets the model spin up a browser, navigate, and verify UI behavior directly.

Results

	Claude Sonnet 4.6	NVIDIA NIM (free)	Local Qwen3.6-35B-A3B-UD-IQ3_XXS
Milestones completed	M0-M9 (all 9)	M0-M3 (with gaps)	M0-M3 (solid)
Unit tests	47/47	14/14	39/39
Deployable?	Yes, fully	Barely	Yes (browse-only)
Time	One evening (~5 hours)	A few days	Each milestone took days

Claude Sonnet 4.6 built all 9 milestones in a single evening. Complete feature set: wheel spinner with confetti and tick sound, side-by-side compare view with PNG export, full watchlist with pre/post-watch rankings. Not pixel-perfect but shippable. Honestly impressive, and it's why I still pay for the subscription.

NVIDIA NIM free got through M1-M3 over a few days. I spent the least time with this one and the results were weaker than I expected when I went back and looked. The planning doc said M3 was done. The actual code was mostly stubs. This is a real problem with smaller/less capable models: they'll claim something is complete when it isn't. You have to keep going back and asking "are you actually sure that's done?" or just checking the code yourself.

Local Qwen3.6-35B also got through M0-M3 over a few days per milestone. Same over-reporting problem applies here too, more so than with the bigger NIM models. It makes mistakes constantly, but it doesn't loop. It'll go down the wrong path, hit a failing test, and eventually self-correct. With unit tests running on every save and some patience to let it run overnight, it does get there. It's just slow and needs more checking.

Conclusion

When I started this I thought local agentic coding on consumer hardware wasn't viable unless you were buying $2000+ of new gear. Dense 7B models confirmed that impression. MoE changed it.

Qwen3.6-35B-A3B on my 10GB VRAM machine hits 92.7% on EvalPlus, runs at 50 tok/s locally, and once all the Claude Code settings are sorted out it functions as a real coding agent. It makes more mistakes than cloud Claude, it's slower, and you need to babysit it more. But it works, it's fully local, and the hardware requirements aren't what I thought they were a year ago.

If you're doing this, the things that bit me hardest: CLAUDE_CODE_ATTRIBUTION_HEADER=0 is the single highest-leverage setting you'll touch. Claude Code injects a per-request billing hash (cch) that changes every turn and becomes part of the prefill, so every request is a cold start unless you disable it. On an 86K context that's 12s TTFT per turn vs 0.1s. One env var. The SWA/hybrid-attention KV cache bug will silently do the same thing if you're on a fork that hasn't picked up the upstream fix. And smaller models will confidently declare something done when it isn't actually built. You have to read the code, not just the summary.

I'd love to know what others are doing with their setup. What I missed. And how to make my setup better.

Edit: add CPU, and Local Model

I have a 3 part question:

What model is currently needed to have a working coding agent that can handle reading a whole project (say a medium/small project like a website with a frontend/backend/sql database, running in python/flask, that would be deployed as a Docker image). Build new features, do optimizations, and debug issues.
What minimum system requirement is needed to run that model?
What would be a best bang for your buck to get those system requirements? Is there a single GPU that can run this? Multiple older GPUs? is it more efficient to get a Mac Mini? Or are the requirements too big and you really need to do some clustering to get this off the floor to be usable? And what are the prices for these setups?

I'm coming from using Claude Code (claude-sonnet-4.6 - 1M context). I know this isn't a fair comparison to what I'm asking, but I'm wonder what it would take to get something even close to working like Claude Code on local hardware right now.

I've personally tried setting something up on my setup, but it was a disaster. Currently I have a RTX 3080 10GB VRAM, 96GB ram, Intel Core i5-12600K. I was able to get Qwen 2.5 coding 7B running on my PC no worries, but it's just way too stupid to be able to hold the context needed to build a website test application in one go. The test I gave it was: Given a init .md file that describes a website with 3 pages + 1 sign-in page, build it using python/flask/build as Docker image. I have a completed version of this website built by Claude Code and compare how close it was able to get. Qwen 2.5 7B was able to read the init .md file, but got stuck in the loop of asking me what I wanted it to do. If it started to build files, it would write 1 of them, then get stuck just printing them out back to me rather than actually writing the files. And get stuck in a loop to do anything else.

I could potentially get Qwen 2.5 coding 14B to fit in my GPU, but no space for a context window (and when I tried it did spill over to system RAM and died at startup).

Yes I know my setup isn't really designed for local LLM. From what I have learned is that at "minimum" model you'd want for this is Qwen 2.5 coding 14B. In practice this won't actually cut it, and the 30B+ models are the better bet. Needing minimum 24GB VRAM, preferably 36GB for cache and context windows. But without being able to test myself it's hard to tell what others have gotten working. Is this correct? Does this feel like it's getting good enough to "replace" Claude Code locally? (I know Qwen 3 and Qwen 3.6 have been released as newer than the 2.5 variant, but they don't have the smaller models I could load onto my GPU).

And I'm pretty behind on what hardware is available to consumers and what their prices are. Other than expensive. So trying to see what to look into, what the ballpark of pricing would be to be able to run this.

u/drohack

I used Claude Code to build the same web app 3 different ways (cloud Claude, free NVIDIA NIM, local GPU) to see how they compare