u/BenEsq

I'm running a server in my office with a 7900 XTX. Performance is great and 24gb is useful for GPT-OSS-20b. I use the server to run tools for my team including chat. I run a law firm and want to keep client data local.

I've been experimenting with qwen 3.6-27b q6 MLX and qwen 3.6-35b q6 MLX on my Macbook Pro M5 Pro 64gb. The output is impressive. 35b gets 65+ t/s. I can feed it voice notes after a call and it will create note summaries with action items and issue flags. It's honestly as close to a SOTA cloud model as I've seen. In my opinion, it outputs better quality than bigger models.

I'd like to run it on my server, but 24gb of vram isn't enough. I'm looking at adding a second 7900 xtx which is now $1200 to $1300. A 7900 xt with 20gb of vram is $700. My question: is the extra memory bandwith, 4gb of vram, and computing performance worth the $500 difference? I assume running accross pcie becomes the true bottleneck so memory bandwith probably is less important. No one has a crystal ball, but I'm worried that extra 4gb of vram could be very helpful if future models get bigger context windows or more parameters.

As a secondary question, does anyone have experience with ROCm on Linux running a single model on two cards? I know this would be simple with cuda, but I'm not willing to pay nvidia prices for 44 to 48gb of vram.

reddit.com
u/BenEsq — 17 days ago