I tuned llama.cpp on a Windows 11 + WSL Ubuntu laptop and ended up keeping only 2 models:
Gemma 4 E4B IT for fast daily use + vision
Qwen3.6-35B-A3B for bigger text/coding workloads
Hardware
- Quadro RTX 3000 6GB
- i7-10875H
- 64 GB DDR4 2933 MHz
- Samsung 980 PRO 1 TB
Software
- Windows 11 host
- WSL Ubuntu
- llama.cpp
Gemma 4 E4B IT:
./llama.cpp/llama-server
-m $GEMMA_E4B/gemma-4-E4B-it-UD-Q4_K_XL.gguf
--mmproj $GEMMA_E4B/mmproj-BF16.gguf
--alias "gemma4-e4b-vision-fast"
-ngl 99
--flash-attn on
--cache-type-k q4_0
--cache-type-v q4_0
--ctx-size 131072
--batch-size 4096
--ubatch-size 2048
--parallel 1
--no-kv-unified
--threads 8
--threads-batch 12
--threads-http 2
--jinja
--host 127.0.0.1
--port 8080
Result: 49.57 t/s at 128k context, with vision enabled.
Qwen3.6-35B-A3B:
GGML_OP_OFFLOAD_MIN_BATCH=128
./llama.cpp/llama-server
-m $QWEN36_35B/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf
--alias "qwen36-35b-a3b-fast"
--fit off
-ngl 999
--n-cpu-moe 36
--flash-attn on
--cache-type-k q4_0
--cache-type-v q4_0
--ctx-size 65536
--batch-size 4096
--ubatch-size 2048
--parallel 1
--no-kv-unified
--threads 8
--threads-batch 10
--threads-http 2
--reasoning off
--reasoning-budget 0
--cache-ram 0
--jinja
--no-mmap
--host 127.0.0.1
--port 8080
Result: 20.3 t/s at 64k context.
Main questions:
- Is there still anything meaningful left to optimize on Qwen3.6 on a 6 GB GPU?
- For coding, is a small reasoning budget worth enabling?
- On Gemma 4 E4B, is there any obvious improvement left without dropping vision or 128k context?