TL;DR:
We went from 16 t/s with a 14B dense model to 70 t/s with a 35B MoE model on a Mac Mini M4 Pro 64GB, using Ollama 0.22.1 with NVFP4 MLX acceleration. Here's exactly what we did and what we learned:
Hardware
- Mac Mini M4 Pro, 64GB unified memory
- Ollama 0.22.1 (critical — MLX runner was broken in 0.21.x for NVFP4)
Model
- qwen3.6:35b-a3b-coding-nvfp4
Our Journey: 4 Stage Benchmarking
We ran a systematic benchmark across multiple models and configurations using an identical prompt and 120 output tokens each time. Here are the results:
| MODEL | FORMAT | ENGINE | CTX | KV CACHE | THINKING | AVG EVAL RATE |
|---|---|---|---|---|---|---|
| Qwen2.5-14B | GGUF Q8_0 | llama.cpp | 32k | q8_0 | — | 16.2 t/s |
| Qwen3.6-35B | GGUF Q5_K_M | llama.cpp | 32k | q8_0 | ON | 27.3 t/s |
| Qwen3.6-35B | GGUF Q5_K_M | llama.cpp | 32k | q8_0 | OFF | 27.5 t/s |
| Qwen3.6-35B | GGUF Q5_K_M | llama.cpp | 8k | q8_0 | OFF | 27.4 t/s |
| Qwen3.6-35B | GGUF Q5_K_M | llama.cpp | 32k | f16 | OFF | 27.8 t/s |
| Qwen3.6-35B | NVFP4 | MLX native | 32k | MLX native | OFF | 69.8 t/s |
Key Findings
1. MoE beats dense models — even at larger scale
The 35B MoE (Mixture-of-Experts) model was 69% faster than the 14B dense model at the same context window. This seems counterintuitive but makes sense once you understand MoE: only ~3.5B parameters activate per token, not all 34.7B. The active parameter count during inference is actually smaller than the 14B dense model, despite the total being much larger.
We dropped the 14B entirely. There was no reason to keep it.
2. Context window reduction doesn't help for short tasks
We tested 32k vs 8k context windows expecting a speed improvement. There was none i.e. less than 0.1 t/s difference. The quadratic attention cost only matters when you're near the context limit. For agentic tasks under 500 tokens, context window size is irrelevant to throughput.
3. KV cache type (f16 vs q8_0) makes negligible difference
Switching `OLLAMA_KV_CACHE_TYPE` from `q8_0` to `f16` yielded +0.3 t/s — well within noise margin. Not worth the extra RAM cost at scale. We kept f16 for marginal accuracy benefit but don't expect a speed win from it.
4. Thinking suppression saves token budget, not wall-clock time
Qwen3.6 is a thinking model. With thinking ON, all 120 eval tokens went to the `<think>` reasoning trace — zero useful output. Speed was identical with thinking OFF. The gain is purely token efficiency, not throughput.
For NVFP4 specifically, thinking tokens appear in a separate `thinking` field — so the `response` field is always clean regardless. Pass `think: false` at the API call level when you don't need reasoning.
Results
NVFP4 is the real breakthrough — 2.5× faster than GGUF Q5
The NVFP4 model (`qwen3.6:35b-a3b-coding-nvfp4` from the Ollama library) uses NVIDIA's floating-point 4-bit quantisation format, which Ollama's MLX engine accelerates natively on Apple Silicon.
- 69.8 t/s vs 27.8 t/s for GGUF Q5_K_M
- 21GB RAM vs 26GB for GGUF Q5_K_M (saves 5GB)
- Comparable quality FP4 preserves dynamic range better than INT4
This is not theoretical. It just works, out of the box, with `ollama pull`.
Critical Setup Notes
Ollama version matters — 0.21.x will crash with NVFP4
mlx runner failed: golang.org/x/sync@v0.17.0/errgroup/errgroup.go:78
If you see this error, upgrade to 0.22.1. The MLX runner had a critical bug in 0.21.x for NVFP4 models.
brew upgrade ollama
After upgrading, force restart the server
Brew upgrade installs the new binary but the old server process keeps running. Kill it explicitly:
pkill -f "ollama serve"
launchd will re-spawn the new version automatically.
Protect your custom launchd plist
If you've added custom environment variables to `~/Library/LaunchAgents/homebrew.mxcl.ollama.plist`, brew upgrade can overwrite them.
Our production launchd environment variables
<key>OLLAMA_FLASH_ATTENTION</key><string>1</string>
<key>OLLAMA_KV_CACHE_TYPE</key><string>f16</string>
<key>OLLAMA_HOST</key><string>0.0.0.0:11434</string>
<key>OLLAMA_NUM_PARALLEL</key><string>1</string>
<key>OLLAMA_MAX_LOADED_MODELS</key><string>1</string>
<key>OLLAMA_KEEP_ALIVE</key><string>30m</string>
`NUM_PARALLEL=1` is intentional — for sequential agentic workloads, splitting GPU bandwidth across parallel requests hurts per-request latency. One fast request beats two slow ones.
How to Replicate
ollama pull qwen3.6:35b-a3b-coding-nvfp4
# Run a benchmark
curl -s http://localhost:11434/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3.6:35b-a3b-coding-nvfp4",
"prompt": "Explain the difference between dense transformer and MoE model architectures in 3 bullet points.",
"stream": false,
"think": false,
"options": { "num_predict": 256 }
}' | python3 -m json.tool
- Check eval_duration and eval_count to compute t/s:
- Tokens / second = Eval_count / (eval_duration / 1e9)
*** Tell us what you achieve.
Memory Budget [64GB/32GB]
macOS baseline: ~5GB
Qwen3.6 NVFP4: ~21GB
──────────────────────────
Total: ~26GB
Headroom: [38GB/ 6GB] Comfortable!
Closing Notes ..
On 32GB you'd be tighter but it should still work fine.
The frugal routing principle: Local models handle lightweight frequent ops at zero marginal cost. Cloud handles the work that justifies the spend !!! Happy Inferencing 😄 -