
5 open-source models that fit in 32GB at Q4 quantization, all via Ollama or llmstudio. Tested on both NVIDIA (RTX 5090) and Apple Silicon (M4 Max).
The lineup:
ollama run qwen3:32b # winner for general use ollama run qwen2.5-coder:32b # matches GPT-4o on HumanEval, runs offline ollama run deepseek-r1:32b # best for chain-of-thought reasoning ollama run gemma3:27b # Google's open-weights, native vision ollama run mistral-small:24b # fastest tokens/sec, best for agent loops
Which model wins which workload: https://vist.ly/426dd
Two practical notes:
- On M4 Max → use the MLX path for ~30% faster inference than the default Ollama backend
- On RTX 5090 → straight
ollama runis fine, no tuning needed for any of these
I understand m4max can use run way more powerful models than these but when you are running multiple tabs, apps, cursor, claude code etc, all these models ran perfectly.
Curious what everyone here defaults to — is anyone running multiple models in parallel for different tasks, or am I just being indecisive?