TL;DR:

We went from 16 t/s with a 14B dense model to 70 t/s with a 35B MoE model on a Mac Mini M4 Pro 64GB, using Ollama 0.22.1 with NVFP4 MLX acceleration. Here's exactly what we did and what we learned:

Hardware

Mac Mini M4 Pro, 64GB unified memory
Ollama 0.22.1 (critical — MLX runner was broken in 0.21.x for NVFP4)

Model

qwen3.6:35b-a3b-coding-nvfp4

Our Journey: 4 Stage Benchmarking

We ran a systematic benchmark across multiple models and configurations using an identical prompt and 120 output tokens each time. Here are the results:

MODEL	FORMAT	ENGINE	CTX	KV CACHE	THINKING	AVG EVAL RATE
Qwen2.5-14B	GGUF Q8_0	llama.cpp	32k	q8_0	—	16.2 t/s
Qwen3.6-35B	GGUF Q5_K_M	llama.cpp	32k	q8_0	ON	27.3 t/s
Qwen3.6-35B	GGUF Q5_K_M	llama.cpp	32k	q8_0	OFF	27.5 t/s
Qwen3.6-35B	GGUF Q5_K_M	llama.cpp	8k	q8_0	OFF	27.4 t/s
Qwen3.6-35B	GGUF Q5_K_M	llama.cpp	32k	f16	OFF	27.8 t/s
Qwen3.6-35B	NVFP4	MLX native	32k	MLX native	OFF	69.8 t/s

Key Findings

1. MoE beats dense models — even at larger scale

The 35B MoE (Mixture-of-Experts) model was 69% faster than the 14B dense model at the same context window. This seems counterintuitive but makes sense once you understand MoE: only ~3.5B parameters activate per token, not all 34.7B. The active parameter count during inference is actually smaller than the 14B dense model, despite the total being much larger.

We dropped the 14B entirely. There was no reason to keep it.

2. Context window reduction doesn't help for short tasks

We tested 32k vs 8k context windows expecting a speed improvement. There was none i.e. less than 0.1 t/s difference. The quadratic attention cost only matters when you're near the context limit. For agentic tasks under 500 tokens, context window size is irrelevant to throughput.

3. KV cache type (f16 vs q8_0) makes negligible difference

Switching `OLLAMA_KV_CACHE_TYPE` from `q8_0` to `f16` yielded +0.3 t/s — well within noise margin. Not worth the extra RAM cost at scale. We kept f16 for marginal accuracy benefit but don't expect a speed win from it.

4. Thinking suppression saves token budget, not wall-clock time

Qwen3.6 is a thinking model. With thinking ON, all 120 eval tokens went to the `<think>` reasoning trace — zero useful output. Speed was identical with thinking OFF. The gain is purely token efficiency, not throughput.

For NVFP4 specifically, thinking tokens appear in a separate `thinking` field — so the `response` field is always clean regardless. Pass `think: false` at the API call level when you don't need reasoning.

Results

NVFP4 is the real breakthrough — 2.5× faster than GGUF Q5

The NVFP4 model (`qwen3.6:35b-a3b-coding-nvfp4` from the Ollama library) uses NVIDIA's floating-point 4-bit quantisation format, which Ollama's MLX engine accelerates natively on Apple Silicon.

69.8 t/s vs 27.8 t/s for GGUF Q5_K_M
21GB RAM vs 26GB for GGUF Q5_K_M (saves 5GB)
Comparable quality FP4 preserves dynamic range better than INT4

This is not theoretical. It just works, out of the box, with `ollama pull`.

Critical Setup Notes

Ollama version matters — 0.21.x will crash with NVFP4

mlx runner failed: golang.org/x/sync@v0.17.0/errgroup/errgroup.go:78

If you see this error, upgrade to 0.22.1. The MLX runner had a critical bug in 0.21.x for NVFP4 models.

brew upgrade ollama

After upgrading, force restart the server

Brew upgrade installs the new binary but the old server process keeps running. Kill it explicitly:

pkill -f "ollama serve"

launchd will re-spawn the new version automatically.

Protect your custom launchd plist

If you've added custom environment variables to `~/Library/LaunchAgents/homebrew.mxcl.ollama.plist`, brew upgrade can overwrite them.

Our production launchd environment variables

&lt;key&gt;OLLAMA_FLASH_ATTENTION&lt;/key&gt;&lt;string&gt;1&lt;/string&gt;
&lt;key&gt;OLLAMA_KV_CACHE_TYPE&lt;/key&gt;&lt;string&gt;f16&lt;/string&gt;
&lt;key&gt;OLLAMA_HOST&lt;/key&gt;&lt;string&gt;0.0.0.0:11434&lt;/string&gt;
&lt;key&gt;OLLAMA_NUM_PARALLEL&lt;/key&gt;&lt;string&gt;1&lt;/string&gt;
&lt;key&gt;OLLAMA_MAX_LOADED_MODELS&lt;/key&gt;&lt;string&gt;1&lt;/string&gt;
&lt;key&gt;OLLAMA_KEEP_ALIVE&lt;/key&gt;&lt;string&gt;30m&lt;/string&gt;

`NUM_PARALLEL=1` is intentional — for sequential agentic workloads, splitting GPU bandwidth across parallel requests hurts per-request latency. One fast request beats two slow ones.

How to Replicate

ollama pull qwen3.6:35b-a3b-coding-nvfp4

# Run a benchmark

curl -s http://localhost:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.6:35b-a3b-coding-nvfp4",
    "prompt": "Explain the difference between dense transformer and MoE model architectures in 3 bullet points.",
    "stream": false,
    "think": false,
    "options": { "num_predict": 256 }
  }' | python3 -m json.tool

Check eval_duration and eval_count to compute t/s:
Tokens / second = Eval_count / (eval_duration / 1e9)

*** Tell us what you achieve.

Memory Budget         [64GB/32GB]
macOS baseline:        ~5GB
Qwen3.6 NVFP4:        ~21GB
──────────────────────────
Total:                 ~26GB
Headroom:              [38GB/ 6GB] Comfortable!

Closing Notes ..

On 32GB you'd be tighter but it should still work fine.

The frugal routing principle: Local models handle lightweight frequent ops at zero marginal cost. Cloud handles the work that justifies the spend !!! Happy Inferencing 😄 -

u/Exotic-Tear593