u/Exotic-Tear593

▲ 3 r/Qwen_AI+2 crossposts

TL;DR:

We went from 16 t/s with a 14B dense model to 70 t/s with a 35B MoE model on a Mac Mini M4 Pro 64GB, using Ollama 0.22.1 with NVFP4 MLX acceleration. Here's exactly what we did and what we learned:

Hardware

  • Mac Mini M4 Pro, 64GB unified memory
  • Ollama 0.22.1 (critical — MLX runner was broken in 0.21.x for NVFP4)

Model

  • qwen3.6:35b-a3b-coding-nvfp4

Our Journey: 4 Stage Benchmarking

We ran a systematic benchmark across multiple models and configurations using an identical prompt and 120 output tokens each time. Here are the results:

MODEL FORMAT ENGINE CTX KV CACHE THINKING AVG EVAL RATE
Qwen2.5-14B GGUF Q8_0 llama.cpp 32k q8_0 16.2 t/s
Qwen3.6-35B GGUF Q5_K_M llama.cpp 32k q8_0 ON 27.3 t/s
Qwen3.6-35B GGUF Q5_K_M llama.cpp 32k q8_0 OFF 27.5 t/s
Qwen3.6-35B GGUF Q5_K_M llama.cpp 8k q8_0 OFF 27.4 t/s
Qwen3.6-35B GGUF Q5_K_M llama.cpp 32k f16 OFF 27.8 t/s
Qwen3.6-35B NVFP4 MLX native 32k MLX native OFF 69.8 t/s

Key Findings

1. MoE beats dense models — even at larger scale

The 35B MoE (Mixture-of-Experts) model was 69% faster than the 14B dense model at the same context window. This seems counterintuitive but makes sense once you understand MoE: only ~3.5B parameters activate per token, not all 34.7B. The active parameter count during inference is actually smaller than the 14B dense model, despite the total being much larger.

We dropped the 14B entirely. There was no reason to keep it.

2. Context window reduction doesn't help for short tasks

We tested 32k vs 8k context windows expecting a speed improvement. There was none i.e. less than 0.1 t/s difference. The quadratic attention cost only matters when you're near the context limit. For agentic tasks under 500 tokens, context window size is irrelevant to throughput.

3. KV cache type (f16 vs q8_0) makes negligible difference

Switching `OLLAMA_KV_CACHE_TYPE` from `q8_0` to `f16` yielded +0.3 t/s — well within noise margin. Not worth the extra RAM cost at scale. We kept f16 for marginal accuracy benefit but don't expect a speed win from it.

4. Thinking suppression saves token budget, not wall-clock time

Qwen3.6 is a thinking model. With thinking ON, all 120 eval tokens went to the `<think>` reasoning trace — zero useful output. Speed was identical with thinking OFF. The gain is purely token efficiency, not throughput.

For NVFP4 specifically, thinking tokens appear in a separate `thinking` field — so the `response` field is always clean regardless. Pass `think: false` at the API call level when you don't need reasoning.

Results

NVFP4 is the real breakthrough — 2.5× faster than GGUF Q5

The NVFP4 model (`qwen3.6:35b-a3b-coding-nvfp4` from the Ollama library) uses NVIDIA's floating-point 4-bit quantisation format, which Ollama's MLX engine accelerates natively on Apple Silicon.

  • 69.8 t/s vs 27.8 t/s for GGUF Q5_K_M
  • 21GB RAM vs 26GB for GGUF Q5_K_M (saves 5GB)
  • Comparable quality FP4 preserves dynamic range better than INT4

This is not theoretical. It just works, out of the box, with `ollama pull`.

Critical Setup Notes

Ollama version matters — 0.21.x will crash with NVFP4

mlx runner failed: golang.org/x/sync@v0.17.0/errgroup/errgroup.go:78

If you see this error, upgrade to 0.22.1. The MLX runner had a critical bug in 0.21.x for NVFP4 models.

brew upgrade ollama

After upgrading, force restart the server

Brew upgrade installs the new binary but the old server process keeps running. Kill it explicitly:

pkill -f "ollama serve"

launchd will re-spawn the new version automatically.

Protect your custom launchd plist

If you've added custom environment variables to `~/Library/LaunchAgents/homebrew.mxcl.ollama.plist`, brew upgrade can overwrite them.

Our production launchd environment variables

&lt;key&gt;OLLAMA_FLASH_ATTENTION&lt;/key&gt;&lt;string&gt;1&lt;/string&gt;
&lt;key&gt;OLLAMA_KV_CACHE_TYPE&lt;/key&gt;&lt;string&gt;f16&lt;/string&gt;
&lt;key&gt;OLLAMA_HOST&lt;/key&gt;&lt;string&gt;0.0.0.0:11434&lt;/string&gt;
&lt;key&gt;OLLAMA_NUM_PARALLEL&lt;/key&gt;&lt;string&gt;1&lt;/string&gt;
&lt;key&gt;OLLAMA_MAX_LOADED_MODELS&lt;/key&gt;&lt;string&gt;1&lt;/string&gt;
&lt;key&gt;OLLAMA_KEEP_ALIVE&lt;/key&gt;&lt;string&gt;30m&lt;/string&gt;

`NUM_PARALLEL=1` is intentional — for sequential agentic workloads, splitting GPU bandwidth across parallel requests hurts per-request latency. One fast request beats two slow ones.

How to Replicate

ollama pull qwen3.6:35b-a3b-coding-nvfp4

# Run a benchmark

curl -s http://localhost:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.6:35b-a3b-coding-nvfp4",
    "prompt": "Explain the difference between dense transformer and MoE model architectures in 3 bullet points.",
    "stream": false,
    "think": false,
    "options": { "num_predict": 256 }
  }' | python3 -m json.tool
  • Check eval_duration and eval_count to compute t/s:
  • Tokens / second = Eval_count / (eval_duration / 1e9)

*** Tell us what you achieve.

Memory Budget         [64GB/32GB]
macOS baseline:        ~5GB
Qwen3.6 NVFP4:        ~21GB
──────────────────────────
Total:                 ~26GB
Headroom:              [38GB/ 6GB] Comfortable!

Closing Notes ..

On 32GB you'd be tighter but it should still work fine.

The frugal routing principle: Local models handle lightweight frequent ops at zero marginal cost. Cloud handles the work that justifies the spend !!! Happy Inferencing 😄 -

reddit.com
u/Exotic-Tear593 — 11 days ago