
I spent a while getting this dialed in and wrote up the full recipe. Short version:
- 35B MoE TQ3_4S fits in 12.4GB of weights
- KV cache at q8_0/q8_0 and 262K context only uses 2.7GB because MoE only has 10 attention layers out of 40
- Total VRAM: ~16GB, leaving ~7GB headroom on a 3090
- ~111 tok/s generation
The thing that surprised me most was how little the KV cache costs at full context. I kept expecting it to OOM and it just... didn't. The math on MoE attention layers makes a real difference at this context length.
The guide covers building llama.cpp-tq3 from source, tuning the KV cache, and wiring it up to OpenCode running in WSL. There's a WSL networking gotcha that cost me an afternoon: host.docker.internal doesn't work. You need the nameserver IP from /etc/resolv.conf.