u/ex-arman68 — reddlx

Image 1 — The pacman benchmark: finally a viable local agentic coding agent with Qwen 3.6 27b

Image 2 — The pacman benchmark: finally a viable local agentic coding agent with Qwen 3.6 27b

Image 3 — The pacman benchmark: finally a viable local agentic coding agent with Qwen 3.6 27b

The pacman benchmark: finally a viable local agentic coding agent with Qwen 3.6 27b

One way I like to test new models, is by one-shoting (with a good prompt) a single webpage clone of the classic arcade game pacman. I usually do 3 attempts and keep the best one. So far all of them, including anthropic, chatgpt and google models, have failed, most of them miserably. The best one until now was GLM 5.1

That was until I tried it with Qwen 3.6 27b F16. Out of 3 attempts, 2 were the best by far, with the top result only having minor errors! However, as soon as I dropped to 8bit quantisation, I could not replicate those good results even after trying 5+ times. This goes to show what I have saying for a long time, based on my experience: there is a world of difference between a 16bit and a 8bit quant, despite most people claiming it is lossless, or nearly lossless.

The results were so good, and since it just happened that I was testing the llama.cpp MTP speculative decoding PR (not yet merged at that time) with my own quants, and developing my own fixed jinja chat template for Qwen 3.5/3.6, I thought why not try to push Qwen 3.6 27b F16 through a proper agentic coding workflow. I think the results were brilliant, and they speak for themselves. You can try the full single page game here:

https://guigand.com/pacman

Lessons learned and observations:

* A good chat template is critical. The official chat template was unusable due to it being only targeted at vLLM, and therefore full of errors in other tools. I started with community templates, which were improvements, but still had many quirks. This is why I started fixing the bugs one by one in the official templates, and slowly improving it. The beginning of the agentic sessions were painful due to many quirks and errors. But slowly it improved, and once I got the template well tuned, it felt like I had unlocked a new level of intelligence in the model.

* MTP speculative decoding does not accelerate all tasks identically. Basically it is most efficient at deterministic task like coding, and least at creative tasks like brainstorming. I wrote about it here: https://www.reddit.com/r/LocalLLaMA/comments/1t9gcar/mtp_benchmark_results_the_nature_of_the/ - For this pacman development, my generative tok/s varied between 8 tok/s and 18 tok/s depending on the task. For reference, without MTP, I get 6.6 tok/s with the same model and quant.

* Not all harnesses are equals both in terms of code quality but also in terms of impact on speed. Most of use already know that the coding harness has a huge impact on quality, with Claude Code being considered the gold standard; this is what I use for normal daily coding. In this case I started with Qwen CLI, mostly because of the chat template problems, on the principle that if there was one harness more likely to better handle Qwen LLM specifics, it would be their own harness. I was actually pleasantly surprised, and Qwen CLI delivered far beyond what I was expecting! In the later stages, I switched back to Claude Code, mostly to verify that the final chat template was working properly there too. I did not notice any improved process or code quality. What I noticed though, is that developing in Claude Code was a lot slower than in Qwen CLI! This is due to all the extra prompts built within Claude Code. With a local model that has such a slow tok/s, it can make the difference between being usable, and between being borderline hair pulling...

* Context management and caching is super efficient in this model. Do not interfere with it. It works great, let it do its thing. Do not use any skill, plugin, etc, that manipulates the cache or context. This will result in confusing the model and making it a lot dumber and error prone.

* Tool calls, context compaction, shell usage, subagents, parallel subagents, work flawlessly. Initially it did not though, and it took me a long time and lots of work to get it right through chat template fixes and improvements. I actually only used context compaction for testing, and it was fine, as usual in Claude Code.

* High context is usable without too much degradation. Maximum context size is 256k tokens I believe. Most of the time I planned the tasks to stay below 100k, but there were a few times I pushed it slightly over 150k. I did notice slightly reduced capabilities, but nothing major. The main reasons why I tried to keep it low is to get the best reasoning capabilities, as with all other models, but also speed started to decrease as the context usage grew.

* Apart from Gemini, this is the first model that impressed me with its audio knowledge. As a composer, musician, psychoacoustic scientist, and audio engineer, I pay a lot of attention to good audio. In this case, I tasked it to do some advanced audio manipulation and creation. All the audio in the game comes from Qwen having programmed the web audio synthesizer in a highly advanced and complex way. This is not midi, not simple wavetables, not samples. It takes into account psychoacoustic properties tuned to human hearing, with the use of harmonics, distorsion, layers, various effects. Truly impressive work. The only exception is the waka-waka sound, for which I had to make it use a sample (the same method was used in the original arcade game).

* I can live with slow token generation speed. I used to think that I needed a minimum of 70 to 80 tok/s for viable development. But this was usable, gave me time to do other things in parallel, and also to better reflect on the agentic tasks. I would probably not use it for large projects, with my current hardware, but for small to medium project, it is definitely acceptable.

If you read until here, let me know what you think, and I hope you enjoy the game.

Dev environment: macOS, apple silicon M2 max, 96GB RAM, llama.cpp server with OpenAI and Anthropic API endpoints.

>Edit: Qwen Code has a default timeout of 8 mins, and a default maximum response size of 8000 tokens. With a slower model., like this one, I was getting frequent timeouts initially. And with large planning/brainstorming/coding sessions, I was occasionally getting the response truncated, which required reprocessing. I solved it my making the following changes to my ~/.qwen/settings.json file:

  "modelProviders": {
    "openai": [
      {
        ...
        "generationConfig": {
          ...
          "timeout": 1800000,
          "maxRetries": -1,
          "samplingParams": {
            "max_tokens": 32768
          }
        }
      }
    ]
  },

u/ex-arman68 — 17 hours ago

▲ 27 r/Qwen_AI+1 crossposts

The pacman benchmark: finally a viable local agentic coding agent with Qwen 3.6 27b

https://guigand.com/pacman

Lessons learned and observations:

If you read until here, let me know what you think, and I hope you enjoy the game.

Dev environment: macOS, apple silicon M2 max, 96GB RAM, llama.cpp server with OpenAI and Anthropic API endpoints.

  "modelProviders": {
    "openai": [
      {
        ...
        "generationConfig": {
          ...
          "timeout": 1800000,
          "maxRetries": -1,
          "samplingParams": {
            "max_tokens": 32768
          }
        }
      }
    ]
  },

u/ex-arman68 — 17 hours ago

▲ 125 r/LocalLLaMA

MTP benchmark results: the nature of the generative task dictates whether you will benefit (coding) or get slower inference (creative) from speculative inference. No other factor comes close.

I recently published MTP quants of Qwen 3.6 27B and I was suprised by the reports here on reddit, and on HF, of users who were experiencing worst speed with speculative inference than without. This did not match what I was seeing, but when I tried to reproduce their exact usage, it confirmed what they were experiencing.

I tried to analyse the problem, made a few conjectures which later turned out to be false, and started a full blown systematical analysis, running 300+ tests and benchmarks, collecting and comparing the results of changing various parameters. This is what I found:

>F16 + MTP nearly triples coding tasks speed. Q4_K_M + MTP slows down creative writing. Same feature, same model, same settings, opposite results.

I did not test all quant sizes, otherwise I would still be here in a few days, but restricted my self to 5 significant ones. The other parameters I varied were task type (4 types), temperature (0.0 0.3 0.7), quantisation of the MTP layer (q8 and matching the model quant). Temp and MTP quant have very little impact on the outcome.

Cumulative average decode speeds with MTP compared to the baseline without MTP, varying the model quant and task type:

quant	base tok/s	code	factual	analysis	creative
Q4_K_M	15.1	19.7	17.5	14.9	13.7
Q5_K_M	13.1	19.2	16.5	14.7	12.6
Q6_K	13.4	20.1	17.6	15.2	13.4
Q8_0	11.4	25.4	21.7	18.6	16.9
F16	6.6	17.9	14.9	12.6	11.0

The memory bandwidth dictates how much the model can benefit from speculative decoding. F16 at 51GB crawls at 6.6 tok/s because every token means dragging the full model through memory. Accepted MTP drafts skip that pass. Q4_K_M at 16GB is already fast enough that the draft overhead is barely worth it on anything less predictable than code.

What controls the draft tokens acceptance rate:

task	acceptance	examples
code	79-89%	writing functions, debugging, refactoring
factual	62-70%	definitions, translation, math proofs
analysis	48-56%	tradeoff breakdowns, technical comparisons
creative	39-48%	stories, poetry, brainstorming, roleplay

40 points from code to creative. I tried three temperatures and five quants. The numbers barely changed. 4/5 draft tokens are correct on coding task; not even 1/2 on creative tasks. Nothing else comes close to mattering as much as what you're generating.

I also tested the optimal number of draft tokens for this model in all the above scenarios. 3 is the sweet spot for draft tokens. Go higher and acceptance falls faster than the extra drafts compensate. F16 is the exception: N=4 beats N=3 (17.9 vs 16.2) because at 6.6 tok/s every surviving draft token is worth the lower hit rate.

use case	Q4_K_M	Q5_K_M	Q6_K	Q8_0	F16
coding	🟢 +31%	🟢 +47%	🟢 +50%	🟢 +123%	🟢 +171%
factual QA	🟡 +16%	🟢 +26%	🟢 +31%	🟢 +90%	🟢 +125%
analysis	🔴 -1%	🟡 +12%	🟡 +13%	🟢 +64%	🟢 +91%
creative	🔴 -9%	🔴 -4%	🔴 -1%	🟢 +48%	🟢 +67%

🟢 speeds up, 🟡 marginal gain, 🔴 slowdown.

Q8_0 and F16: always use speculative decoding with MTP layer.
Coding tasks at any quant: keep it on.
Q4_K_M (and below) creative tasks keep it off

One last obervation: with thinking mode turned on for coding tasks: Q8_0 draft token acceptance drops from 87% to 73%. Still +94% speedup, just not the full +123%.

Test environment: Apple Silicon M2 Max 96GB, llama.cpp manual build with the MTP PR, Qwen3.6-27B with MTP layers preserved.

reddit.com

u/ex-arman68 — 9 days ago

▲ 40 r/ZaiGLM

I received the following email from z.ai - glad to see they were actually listening and working on fixing the problem. Better communication as they were doing so would have been good PR, but in the end, the result is there.

Hi developers,

Some of you flagged occasional garbled outputs and unexpected behavior when building with the GLM-5 series, especially under heavy workloads. We heard you, reproduced the issues, and the fixes are now live.

What looked like model degradation turned out to be an infrastructure issue. It's now fully resolved.

You may have noticed:
- Abnormal outputs reduced to near-zero levels.
- Faster TTFT and more reliable serving during peak concurrency.

For those interested in the technical details, we wrote up the full story here: z.ai/blog/scaling-pain. We've also contributed one of the fixes back to the SGLang community.

Thank you for building with us, and for flagging these.

The Z.ai Team

reddit.com

u/ex-arman68 — 18 days ago

▲ 1.1k r/LocalLLaMA

>This model seems utterly broken for now. I do not recommend downloading or using it, unless you are planning to help troubleshoot it. This is not a problem with the conversion, but with the model itself.

I converted Mistral medium 3.5 128B to MLX 4bit. Eagle model for speculative decoding is not yet supported by MLX.

Vision encoder included (full BF16 unquantized. Thinking mode works (reasoning_effort="high" gives you the [THINK]...[/THINK] chain), tool calling works, 256K context.

There was a bug in mlx-vlm's mistral3 sanitize function: it wasn't stripping the model. prefix from vision tower and projector keys. This caused 438 parameters to be skipped. I patched it locally before converting. Details in the HF readme.

I am getting ~5 tok/s on a 96 GB M2 Max. For sampling I recommend using temp 0.7 / top_p 0.95 / top_k 20 in reasoning mode, or temp 0.0–0.7 / top_p 0.8 for quick replies. Mistral recommends leaving repeat penalty disabled, but I am getting too many loops; I am not sure what the best value should be.

u/ex-arman68 — 19 days ago