u/Main_Brush_5086

Dense vs MoE models on iGPU — same 28B, 5x the speed

The formula for memory-bound inference is just:

tok/s ≈ bandwidth ÷ bytes read per token

Running on a Radeon 780M with shared DDR5 RAM (~75-80 GB/s effective bandwidth):

Qwen3-27B at Q4_K — dense, every token reads all 14GB of weights: 14 GB ÷ 78 GB/s → ~5.7 tok/s (measured: 5.8)

Gemma 4 28B — MoE, each token only activates ~4-5B params out of 28B: 4 GB ÷ 78 GB/s → ~20 tok/s (measured: 19.5)

Same stated size. 5x faster. Because you're reading 5x less data per token.

The inactive experts aren't wasted — the router picks the best-matched ones for each token, that's where the quality comes from. You just don't pay bandwidth cost for the ones that aren't selected.

Pattern holds at 32B too: dense Qwen3-32B at Q8 hits 2.8 tok/s, the MoE variant (A3B, ~3B active) hits 20.8 tok/s on the same box.

If you're running local models on integrated graphics, MoE is worth understanding.

Running 28B LLMs locally on a ~$550 mini PC (no discrete GPU)

Running 28B LLMs locally on a ~$550 mini PC (no discrete GPU)