u/Cyber_Ghost

Evaluating Gemma 4 vs Quentin 3.6/3.5 models on my hardware

Evaluating Gemma 4 vs Quentin 3.6/3.5 models on my hardware

So as I stated in a few comments in the subreddit, I spent the last few days running a homegrown eval across 4 models on my 2x Intel Arc Battlemage rig with Claud’s assistance.

Claude wrote up the methodology, prompts, and served as judge. The full numbers and writeup are available here: https://github.com/pelegw/llm-eval

This is not a leaderboard-style benchmark. It's a small eval for things I actually care about and feel that Claud can grade objectively: reasoning, coding, code quality (correctness + robustness + ruff/ast static analysis), instruction following, long-context retrieval, writing (rubric-scored), and synthetic single-step tool calling. Two tiers per capability, a base "sanity floor" and a "hard" set built to actually discriminate strong models. Every prompt runs twice, thinking on and thinking off.

Models:

* gemma-4-26b-a4b (MoE ~4B active) at Q8

* gemma-4-31b (dense) at Q5

* qwen3.6-35b-a3b (MoE ~3B active) at Q8

* qwen3.5-122b-a10b (MoE ~10B active) at Q3_K_XL (the big quant asterisk)

With the two Gemma models leading, it seems that 26b-a4b sometimes overthinks itself into a loop and doesn’t return an answer at all, while 31B is slower but more robust. How much of this is going to be evident under real world use remains to be seen in actual work. With the results being so close between Gemma and Qwen I can see how variations in output may sway people into using one or the other.
While being bigger, it seems that qwen3.5 really suffered from the small quant I used and I may try to rerun it if I get access to more vram with a higher quant.

Some caveats on the evaluation:

  1. Sampling is per-vendor recommendation (Gemma uses Gemma's temp 1.0 / top_p 0.95 / top_k 64, the Qwens use Qwen3's temp 0.7 / top_p 0.8 / top_k 20 / presence_penalty 1.5), so cross-comparison isn't sampling-identical. A sampling-matched rerun would tighten the rankings.

  2. The hard tier should probably be harder for frontier-class models, it's calibrated for the local cohort. Claude built it initally with this understanding in mind and I did not wanot to modify it mid run.

  3. Eval doesn't test long-horizon agentic loops or multi-step tool chains, just the single-step "given a tool spec, call it right" pattern.

So that’s where it stand now for me - I’ll keep on Gemma 31B as my daily driver probably, especially with MTP coming it’s going to be more useful and snappy.

u/Cyber_Ghost — 9 hours ago