u/One_Key_8127

🔥 Hot ▲ 212 r/LocalLLaMA

Gemma 4 is good

Waiting for artificialanalysis to produce intelligence index, but I see it's good. Gemma 26b a4b is the same speed on Mac Studio M1 Ultra as Qwen3.5 35b a3b (~1000pp, ~60tg at 20k context length, llama.cpp). And in my short test, it behaves way, way better than Qwen, not even close. Chain of thoughts on Gemma is concise, helpful and coherent while Qwen does a lot of inner-gaslighting, and also loops a lot on default settings. Visual understanding is very good, and multilingual seems good as well. Tested Q4_K_XL on both.

I wonder if mlx-vlm properly handles prompt caching for Gemma (it doesn't work for Qwen 3.5).

Too bad it's KV cache is gonna be monstrous as it did not implement any tricks to reduce that, hopefully TurboQuant will help with that soon. [edit] SWA gives some benefits, KV cache is not as bad as I thought, people report that full 260K tokens @ fp16 is like 22GB VRAM (for KV cache, quantized model is another ~18GB @ Q4_K_XL). It is much less compacted than in Qwen3.5 or Nemotron, but I can't say they did nothing to reduce KV cache footprint.

I expect censorship to be dogshit, I saw that e4b loves to refuse any and all medical advice. Maybe good prompting will mitigate that as "heretic" and "abliterated" versions seem to damage performance in many cases.

No formatting because this is handwritten by a human for a change.

[edit] Worth to note that Google's AI studio version of Gemma 26b a4b is very bad. It underperforms my GGUF with tokenizer issues :)

reddit.com
u/One_Key_8127 — 16 hours ago
▲ 15 r/unsloth

Qwen3.5 27b UD_IQ2_XXS & UD_IQ3_XXS behave very poorly or is it just me?

I downloaded unsloth studio and tried Qwen3.5 27b UD_IQ2_XXS & UD_IQ3_XXS, for starters I gave them a hard captcha image to solve to see how they reason and if they can interpret the image well. They both keep looping for thousands of tokens and don't produce correct results. On top of that, IQ3_K_XXS takes over 16.5GB loaded in VRAM on my laptop (via task manager / llama-server GPU memory use), making it not fit on my 16GB card, even though its supposed to be 11GB.

Qwen3.5 9b UD-Q4_K_XL on the other hand reasons correctly and handles the task very well. Anyone has similar observations? I had high hopes for low quants of Qwen3.5, but from my tests it looks like they degrade heavily and it's not a viable path of using these models. Can you share your observations of how well these quants perform for you?

reddit.com
u/One_Key_8127 — 4 days ago