u/AgreeableFall5530 — reddlx

I tuned llama.cpp on a Windows 11 + WSL Ubuntu laptop and ended up keeping only 2 models:

Gemma 4 E4B IT for fast daily use + vision
Qwen3.6-35B-A3B for bigger text/coding workloads

Hardware

Quadro RTX 3000 6GB
i7-10875H
64 GB DDR4 2933 MHz
Samsung 980 PRO 1 TB

Software

Windows 11 host
WSL Ubuntu
llama.cpp

Gemma 4 E4B IT:

./llama.cpp/llama-server
-m $GEMMA_E4B/gemma-4-E4B-it-UD-Q4_K_XL.gguf
--mmproj $GEMMA_E4B/mmproj-BF16.gguf
--alias "gemma4-e4b-vision-fast"
-ngl 99
--flash-attn on
--cache-type-k q4_0
--cache-type-v q4_0
--ctx-size 131072
--batch-size 4096
--ubatch-size 2048
--parallel 1
--no-kv-unified
--threads 8
--threads-batch 12
--threads-http 2
--jinja
--host 127.0.0.1
--port 8080

Result: 49.57 t/s at 128k context, with vision enabled.

Qwen3.6-35B-A3B:

GGML_OP_OFFLOAD_MIN_BATCH=128
./llama.cpp/llama-server
-m $QWEN36_35B/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf
--alias "qwen36-35b-a3b-fast"
--fit off
-ngl 999
--n-cpu-moe 36
--flash-attn on
--cache-type-k q4_0
--cache-type-v q4_0
--ctx-size 65536
--batch-size 4096
--ubatch-size 2048
--parallel 1
--no-kv-unified
--threads 8
--threads-batch 10
--threads-http 2
--reasoning off
--reasoning-budget 0
--cache-ram 0
--jinja
--no-mmap
--host 127.0.0.1
--port 8080

Result: 20.3 t/s at 64k context.

Main questions:

Is there still anything meaningful left to optimize on Qwen3.6 on a 6 GB GPU?
For coding, is a small reasoning budget worth enabling?
On Gemma 4 E4B, is there any obvious improvement left without dropping vision or 128k context?