u/Main_Brush_5086
Running 28B LLMs locally on a ~$550 mini PC (no discrete GPU)
Dense vs MoE models on iGPU — same 28B, 5x the speed
The formula for memory-bound inference is just:
tok/s ≈ bandwidth ÷ bytes read per token
Running on a Radeon 780M with shared DDR5 RAM (~75-80 GB/s effective bandwidth):
Qwen3-27B at Q4_K — dense, every token reads all 14GB of weights: 14 GB ÷ 78 GB/s → ~5.7 tok/s (measured: 5.8)
Gemma 4 28B — MoE, each token only activates ~4-5B params out of 28B: 4 GB ÷ 78 GB/s → ~20 tok/s (measured: 19.5)
Same stated size. 5x faster. Because you're reading 5x less data per token.
The inactive experts aren't wasted — the router picks the best-matched ones for each token, that's where the quality comes from. You just don't pay bandwidth cost for the ones that aren't selected.
Pattern holds at 32B too: dense Qwen3-32B at Q8 hits 2.8 tok/s, the MoE variant (A3B, ~3B active) hits 20.8 tok/s on the same box.
If you're running local models on integrated graphics, MoE is worth understanding.