Qwen3.6-35B Q5_K_XL vs Qwen3.6-27B Q3_K_M on 16Gb VRAM
Hello
I currently use Qwen3.6-35B Q5_K_XL without MTP on a 4070 ti super 16GB, on a system with 32GB DDR5 and 7800X3D for cpu
I can achieve this by offloading some experts on CPU
I reach 60t/s for generation. My k/v is quantized at q8 and use 128k context size. If I try 256k context I am at 50 t/s
But I find sometimes the model dumb, maybe cuz active experts are not the best, for example I cannot add a field on frontend(Angular) and bind into backend (C#) with one prompt. I try Qwen3.6 27B-Q4, with this model I can do but it is very slow (x5 more time)
So I tried Qwen3.6-27B Q3_K_M. It can do angular + c# but I noticed some syntax error, but it fix itself after lint.
Is the quantisation the problem ? Q3 too low ?
Maybe how I can tell the prompt to reset active experts between backend and frontend ?
Thanks