u/ElectrifiedThor

I’m setting up a small homelab for local LLM inference (coding assistants and local knowledge tools), mostly targeting ~20B–40B models like Qwen and Gemma using INT4/FP4 quantization.

I’m trying to understand the real-world tradeoff between running dual 3090s with more total VRAM versus moving to a 50-series card like a 5070 Ti or 5080, which has much higher low-precision throughput but significantly less VRAM.

For those with hands-on experience, what tends to become the bottleneck around ~30B models in practice, VRAM capacity or compute throughput? And how meaningful is the actual speed gain from INT4/FP4 on newer architectures compared to 3090-class cards? Will there be a bigger speed gain gap in the future as the latest tensor core gen gets mature? Any concrete tokens/sec comparisons or observations would be really helpful.

Not looking for a generic recommendation, just trying to better understand how these tradeoffs play out in real workloads.

Context: I already have 2x 3060s 12GB variants laying around.

reddit.com
u/ElectrifiedThor — 17 days ago