
5060ti 16GB Benchmark Data +43–55% on Qwen3.6-35B-A3B with llama.cpp's ngram-mod 149 tok/s at 16k context depth
Here's an optimization in llama.cpp that gives meaningful decode speedup on long-context workloads. Sharing the result + config.
Model: Qwen3.6-35B-A3B Opus-Distill (UD-IQ2_M quant, ~14 GB)
Hardware: RTX 5060 Ti 16GB (Blackwell)
Method: 256-token natural summarization output, averaged over 2 runs after 1 warmup,
Results:
Depth Baseline + ngram-mod Speedup Wall saved/response
────────────────────────────────────────────────────────────────────
0 (cold) 107 t/s 123 t/s 1.15x ~0.3s
16K 96 t/s 149 t/s 1.55x ~0.9s
32K 88 t/s 137 t/s 1.55x ~1.0s
65K 76 t/s 108 t/s 1.43x ~1.0s
At deep context, every response shaves about a full second off the wait time. Cold-cache depth=0 sees only modest gain — the n-gram cache hasn't accumulated enough patterns to draft from on the very first request. Speedup grows once the conversation has context to mine.
Why ngram-mod specifically:
llama.cpp has four n-gram speculative decoding modes (--spec-type ngram-simple, ngram-map-k, ngram-map-k4v, ngram-mod). I tested all four. The first three lost to baseline on this model — their ~12% acceptance rate doesn't overcome the speculation overhead. Only ngram-mod wins because it uses a cross-request shared hash pool (~16 MB) that persists across requests and accumulates patterns over time. Acceptance rate at depth: 35-90% depending on how repetitive the output is (tool calls, JSON, restated values benefit most).
Zero quality risk: speculation is mathematically guaranteed to produce identical output to baseline. The main model verifies every proposed token; only matches are kept. Worst case if patterns don't repeat: ~1-2% slowdown from speculation overhead. Cold-cache requests run at ~baseline speed.
The config (5 flags, append to your llama-server args before --port):
--spec-type ngram-mod \
--spec-draft-n-max 32 \
--spec-ngram-mod-n-match 24 \
--spec-ngram-mod-n-min 48 \
--spec-ngram-mod-n-max 64
Methodology note: My initial bench showed >4x speedups but I caught a measurement artifact — the bench harness used `ignore_eos=True` which forced the model to keep generating past natural stopping, falling into deterministic loops that ngram-mod could draft at near-100% acceptance. Real-world generation (where EOS is honored and content is non-degenerate) gives the more modest 1.4-1.55x above. If you bench speculation, don't use ignore_eos.
TL;DR: Five flags, 1.4-1.55x decode speedup at deep context on a 35B MoE. No new hardware, no quality tradeoff. Bigger gains on workloads with repetition (tool calls, code, reasoning).