Hi guys, I want to use a strong local AI agent for OpenCode (and possibly Hermes in the future). Maybe I can find someone with a bit more knowledge and experience than me :)
I am running Windows 11 with the following setup:
RTX 5090 (32GB VRAM, Blackwell architecture)
Intel Core Ultra 9 285K (24 cores / 24 threads, up to 5.7 GHz)
64GB DDR5 RAM
Previously I only used Ollama for other models, until I read that vLLM is faster and more efficient. Especially FP8 is said to be close to FP16 in quality.
I initially considered using the FP8 version of Qwen 3.6 27B, since it should roughly fit into 27GB VRAM. However, I assume that context size and overhead make it more difficult to actually run reliably on a 32GB GPU.
So I am trying to figure out the best alternative:
Q8 in Ollama
Q6 as a possible sweet spot
“rotation” / improved quantization versions (if relevant)
or MLC-LLM since it is Windows-native in some setups
As a starting point, I preferred the 27B model, but if hardware constraints make a compromise necessary, maybe the 35B model with a lower quantization would be better.
The question is also which combination makes the most sense:
27B vs 35B
and which quantization (Q6 / Q8 / FP8 / other)
If anyone has tested these models on a similar setup (single RTX 5090), I would really appreciate recommendations for the best possible configuration. And even if you don’t have the exact setup, but still have experience or knowledge, feel free to share your suggestions as well.
Thank you! :D