u/MistingFidgets

5060ti 16GB Benchmark Data +43–55% on Qwen3.6-35B-A3B with llama.cpp's ngram-mod  149 tok/s at 16k context depth

5060ti 16GB Benchmark Data +43–55% on Qwen3.6-35B-A3B with llama.cpp's ngram-mod 149 tok/s at 16k context depth

Here's an optimization in llama.cpp that gives meaningful decode speedup on long-context workloads. Sharing the result + config.

Model: Qwen3.6-35B-A3B Opus-Distill (UD-IQ2_M quant, ~14 GB)

Hardware: RTX 5060 Ti 16GB (Blackwell)

Method: 256-token natural summarization output, averaged over 2 runs after 1 warmup,

Results:

Depth Baseline + ngram-mod Speedup Wall saved/response

────────────────────────────────────────────────────────────────────

0 (cold) 107 t/s 123 t/s 1.15x ~0.3s

16K 96 t/s 149 t/s 1.55x ~0.9s

32K 88 t/s 137 t/s 1.55x ~1.0s

65K 76 t/s 108 t/s 1.43x ~1.0s

At deep context, every response shaves about a full second off the wait time. Cold-cache depth=0 sees only modest gain — the n-gram cache hasn't accumulated enough patterns to draft from on the very first request. Speedup grows once the conversation has context to mine.

Why ngram-mod specifically:

llama.cpp has four n-gram speculative decoding modes (--spec-type ngram-simple, ngram-map-k, ngram-map-k4v, ngram-mod). I tested all four. The first three lost to baseline on this model — their ~12% acceptance rate doesn't overcome the speculation overhead. Only ngram-mod wins because it uses a cross-request shared hash pool (~16 MB) that persists across requests and accumulates patterns over time. Acceptance rate at depth: 35-90% depending on how repetitive the output is (tool calls, JSON, restated values benefit most).

Zero quality risk: speculation is mathematically guaranteed to produce identical output to baseline. The main model verifies every proposed token; only matches are kept. Worst case if patterns don't repeat: ~1-2% slowdown from speculation overhead. Cold-cache requests run at ~baseline speed.

The config (5 flags, append to your llama-server args before --port):

--spec-type ngram-mod \

--spec-draft-n-max 32 \

--spec-ngram-mod-n-match 24 \

--spec-ngram-mod-n-min 48 \

--spec-ngram-mod-n-max 64

Methodology note: My initial bench showed >4x speedups but I caught a measurement artifact — the bench harness used `ignore_eos=True` which forced the model to keep generating past natural stopping, falling into deterministic loops that ngram-mod could draft at near-100% acceptance. Real-world generation (where EOS is honored and content is non-degenerate) gives the more modest 1.4-1.55x above. If you bench speculation, don't use ignore_eos.

TL;DR: Five flags, 1.4-1.55x decode speedup at deep context on a 35B MoE. No new hardware, no quality tradeoff. Bigger gains on workloads with repetition (tool calls, code, reasoning).

u/MistingFidgets — 5 days ago

Created an automated benchmarking suite that uses real world examples from my openclaw bot history to benchmark models on 6 different categories of agentic tasks. The coding test is currently too easy, i'll work on that. These are the best models I've been able to run reliably on an RTX 5060TI 16GB for my desired use case: running my openclaw bots fully local with a good user experience and 128k context window. The 2 bit quants are surprisingly good at the agentic work. I suspect they will show their weaknesses on deeper coding tasks and on precision complex math but for tool calling and other general agent tasks they seem to handle everything well enough. Qwen3.6-35B-A3B Opus distilled is the winner so far. Its been a noticeable improvement over even a q5 or q6 4-9b model while running even faster due to the low qauntization.

Models Tested so far:
Qwen3.6-35B Opus-Distill UD-IQ2_M

Qwen3.6-35B-A3B UD-IQ2_M

Qwen3.6-27B UD-IQ2_M

Qwen3.6-27B UD-IQ3_XXS

Qwen3.5-9B NVFP4

Qwen3.5-4B NVFP4

GPT-OSS 20B Q3_K_M

u/MistingFidgets — 12 days ago

After weeks of pain with NVFP4 on consumer Blackwell (CUDA 13 JIT storms, tool_choice:auto failures, multilingual bleed, etc.), I finally have a stable, fast setup to run my openclaw and hermes bots fully local.

**Winner: cosmicproc/Qwen3.5-4B-NVFP4 (W4A16 Marlin)**

- ~81–97 tok/s single, up to 598 tok/s aggregate

- Excellent tool calling and agent performance

- Running full OpenClaw + Hermes bot locally

- 262K context support with ~120K FP8 KV pool

The infographic walks through the 5 phases it took to get here, all the critical flags, and the hard lessons (including why most other small NVFP4 models still fail at autonomous tool use).

Key fixes included:

- Switching to Marlin backend to dodge flashinfer JIT crashes

- Proper thinkingFormat compat settings for reliable tool_choice:auto

- generation_config.json to kill random Chinese/Cyrillic output

- Prefix caching for blazing TTFT on long system prompts

Still bleeding edge, but now genuinely usable for local agents.

Drop your own 5060 Ti NVFP4 results or questions below — happy to help others skip the headaches!

u/MistingFidgets — 19 days ago