u/Gottimperator1337

I did some extensive testing in LM Studio (v0.4.12) to figure out the best settings for the Qwen 3.6 models (27B vs. 35B-A3B) on my rig (RTX 5070 Ti, 7800X3D, 32 GB RAM, Windows, CUDA).

You can check out the full raw data of my test runs (Context Length, GPU Offload, KV-Cache Quantization) in my spreadsheet here: https://docs.google.com/spreadsheets/d/1Ksqlme6OzRyD0K7lRZUkItA1hUjDO5WDCuqJWraXC-U/edit?usp=sharing

Here is a summary of my main takeaways:

1. 35B-A3B (MoE) clearly beats the 27B model Even though the 35B is nominally larger, its MoE architecture (fewer active parameters per token) makes it run much more efficiently locally. The 27B model hits brutal VRAM cliffs (dropping from 13 to 0.7 tok/s just by increasing offload slightly).

2. Expert Offloading & KV-Cache are game changers for Long Context Initially, my performance at 262k context was terrible (~4 tok/s). The breakthrough came with these two tweaks:

Number of layers to force Experts in CPU: 2
KV Cache Quantization: Q8_0/Q8_0 This instantly boosted my speed to almost 40 tok/s on short prompts!

3. Short Prompts vs. Real-World Tests Synthetic "Hello" prompts give you great numbers (~40 tok/s). However, when testing a real task using my master's thesis (around 33k tokens), the model settled at a very solid 17 to 21 tok/s.

My Sweet Spots (35B-A3B Q4_K_M):

For general use (64k Context): GPU Offload 25, KV-Cache Q8_0, Experts forced to CPU 2, Max Concurrent 1. (Result: ~21 tok/s in real-world test)
For max context (262k Context): GPU Offload 21, KV-Cache Q8_0, Experts forced to CPU 2, Max Concurrent 1. (Result: ~17 tok/s in real-world test)

Conclusion: Pushing GPU offload to the maximum isn't always best. The sweet spot is right before the VRAM cliff. Once Windows starts using shared GPU memory, performance tanks entirely.

Flo

LM Studio Performance Test: Qwen 3.6 27B vs 35B-A3B on RTX 5070 Ti (32 GB RAM)