u/AIForOver50Plus

Spent two days benchmarking three Qwen3.6 variants against gpt-oss:120b on my dev rig MBP M3 Max with Ollama. A few findings worth sharing for anyone running Ollama in production-shaped workflows.

Speed (temp 0.2, --think=false, structured-output research-brief workload):

qwen3.6:35b-a3b-coding-nvfp4    6s   (21 GB)
qwen3.6:35b-a3b-q8_0 (MoE)     22s   (38 GB)
qwen3.6:27b-q8_0 (Dense)       67s   (29 GB)
gpt-oss:120b                   61s   (65 GB)

Ollama-specific findings:

--think=false is honored by all three Qwen3.6 variants. It is silently ignored by gpt-oss. Same flag, same Ollama version, different runtime behavior. gpt-oss still runs full reasoning and dumps it to stdout. If you pipe Ollama output to anything that parses it, you have to engineer around the trace bleed for gpt-oss. Qwen3.6 just works.
Modelfile overlays cost zero disk. I tuned each model with FROM model + PARAMETER temperature 0.2. ollama create reuses content-addressable layers — only a tiny manifest is new. Confirmed by watching ollama create reuse 50+ existing layer hashes. Disk-free tuning is a real feature.
MoE 35B-A3B beats 27B dense by 3x on the same workload. Active-parameter count drives per-token speed once the model fits. On Apple Silicon unified memory, this matters a lot.

Operational gotcha I almost missed:

The text-only coding-NVFP4 will hallucinate image descriptions silently when given an image via the API. Not error, not refuse — fluent, confident, completely fabricated description. Build a routing-layer allowlist for which models can take images: input. Do not rely on the model to refuse on its own. It will not.

Full methodology, Bash benchmark script, all model outputs, and chart:

https://www.fabswill.com/blog/replacing-gpt-oss-with-qwen3-6-on-macbook-pro/?utm_source=reddit&utm_medium=social&utm_campaign=qwen-vs-gptoss

Disclosure: my blog. AI-assisted writeup, methodology and findings are mine.

Canvas Temporarily Unavailable