Qwen3.6-27B DFlash on a 24GB RTX 5090 Laptop (sm_120) — 80 t/s avg via spiritbuun's buun-llama-cpp + Q8_0 GGUF drafter

Following the recent flurry of DFlash work (z-lab paper, Lucebox port, spiritbuun fork), I tried to reproduce on consumer Blackwell mobile — a small home box with an RTX 5090 Laptop GPU (24GB GDDR7, 896 GB/s, sm_120).

TL;DR: 73.94 / 80.31 / 85.06 t/s on three Space Invaders generations (max_tokens=800). AVG ~80 t/s. Going from 0.97 t/s catastrophic to 80 t/s in one week, thanks to spiritbuun's fix to my issue #35.

The journey (with timestamps)

2026-04-28 — I publish a blog post titled "Why DFlash on Qwen3.6-27B doesn't fit on 24GB single GPU". Argument: z-lab drafter is 6 GiB BF16, doesn't fit after the target.
2026-04-30 — spiritbuun/Qwen3.6-27B-DFlash-GGUF lands on HF. Q8_0 drafter at 1.75 GB. VRAM math suddenly works.
2026-04-30 — I build spiritbuun/buun-llama-cpp for sm_120 (CUDA 13.1 + -DGGML_CUDA_NO_VMM=ON + -DCMAKE_CUDA_ARCHITECTURES=120 + libcuda.so.1 stub link). First bench: 3.4 → 1.5 → 0.97 t/s, degrading run over run. File issue #35.
2026-05-01 — spiritbuun replies: "I think this may be fixed now - can you repull and give it another try?"
2026-05-04 — Rebuild with HEAD aecbbd5d (8 commits past my v0.1.0, notably cab1fb597 dflash: add p_min confidence threshold + adaptive draft length). Re-bench: 80 t/s avg.

Bench numbers

Run 1: 800 tok in 10.82s = 73.94 t/s
Run 2: 800 tok in  9.96s = 80.31 t/s
Run 3: 800 tok in  9.41s = 85.06 t/s

Comparison on the same hardware

Backend	Stack	t/s avg
llama.cpp standard	UD-Q4_K_XL, no spec	33-36
vLLM Turbo	v0.20.0 + Sandermage Genesis + TurboQuant K8V4 + MTP n=3	88
buun-llama-cpp DFlash	HEAD `aecbbd5d` + Q8_0 GGUF drafter	80
vLLM vanilla (different setup)	0.19.1 + AutoRound INT4 + MTP n=3	99 peak

For context: Lucebox already published DFlash on RTX 3090 24GB at 78 t/s HumanEval / 70 t/s Math500 (sm_86 Ampere) using their custom engine + BF16 z-lab drafter. Today's Lucebox PR #86 reports 218 t/s on RTX 5090 desktop 32GB. So our 80 t/s on RTX 5090M 24GB sits right between Lucebox 3090 and Lucebox 5090 desktop, on a different stack (buun fork instead of Lucebox custom).

What's actually new

First public DFlash result via buun-llama-cpp on sm_120 mobile (Lucebox path uses their own engine; Lucebox 5090 desktop on PR #86 used a custom build, not buun)
First reproduction confirming the cab1fb597 perf fix on real 24GB consumer hardware (was untested before)
Stack uses Q8_0 quantized drafter (not BF16) — frees enough VRAM that the math just works, no compromises elsewhere

The recipe

Image: built from spiritbuun/buun-llama-cpp master HEAD with:

cmake -B build \
  -DGGML_CUDA=ON -DGGML_CUDA_NO_VMM=ON \
  -DCMAKE_CUDA_ARCHITECTURES=120 \
  -DCMAKE_EXE_LINKER_FLAGS="-Wl,-rpath-link,/usr/local/cuda/lib64/stubs" \
  -DCMAKE_SHARED_LINKER_FLAGS="-Wl,-rpath-link,/usr/local/cuda/lib64/stubs" \
  -DCMAKE_BUILD_TYPE=Release &amp;&amp; \
cmake --build build --target llama-server -j$(nproc)

llama-server args:

--model unsloth/Qwen3.6-27B-Q4_K_M.gguf
--model-draft spiritbuun/dflash-draft-3.6-q8_0.gguf
--spec-type dflash
--n-gpu-layers 99 --n-gpu-layers-draft 99
--ctx-size 32000 --ctx-size-draft 256
--batch-size 256 --ubatch-size 64
--parallel 1 --flash-attn on --jinja
--chat-template-kwargs '{"enable_thinking": false}'

Important: disable thinking (enable_thinking: false). spiritbuun's README notes the drafter wasn't trained on the think-wrapped distribution — leaving thinking on collapses acceptance and gives ~1.8× less throughput.

Things I haven't tried that should push 100+ t/s

DDTree budget tuning (Lucebox uses 22 for 218 t/s on desktop 5090, default in buun likely sub-optimal)
--no-fused-gdn ON vs OFF — recent buun commit 905483277 added this debug flag
p_min adaptive draft length sweep
Pushing context to 64-80K (32K is conservative)

Bonus: PFlash also lands today

While I was writing this up, u/sandropuppo posted PFlash — speculative prefill, complementary to DFlash decode. 10× faster TTFT at 128K on RTX 3090. The pflash/ dir was merged into Lucebox-hub main today. Combining DFlash decode (this post) + PFlash prefill on consumer 24GB Blackwell would close the long-context UX gap completely. Next bench session.

Worth noting: llama.cpp MTP also entered beta today

Same day, u/ilintar posted that llama.cpp MTP is in beta thanks to am17an — PR #22673, tested on Qwen3.6 27B + Qwen3.6 35B-A3B with 75% acceptance at 3 draft tokens and 2× speedup over baseline. Depends on the partial seq_rm for GDN PR #22400 we needed for hybrid spec decoding. So llama.cpp now has BOTH MTP (PR #22673) AND DFlash (this post via buun fork) paths — feature parity with vLLM is closing fast.

Credits

spiritbuun for the fork + the Q8_0 drafter + the 24h fix turnaround
z-lab/dflash for the block-diffusion method
Lucebox for proving the 24GB consumer DFlash path on RTX 3090 first
unsloth for the Qwen3.6-27B Q4_K_M GGUF target

Full write-up with timestamps and all the iteration mistakes: https://airelien.dev/en/posts/dflash-27b-24gb-debloque/ (EN, FR also at /fr/posts/).

Anyone with a 5090M / 4080M / 3090 24GB who wants to reproduce, I'd love to see your numbers.

Context	Runs	AVG t/s	Range	KV cache (turbo3)
96K	10	106.67	97.84-115.36	~3 GB
128K	5	116.0	107.12-127.32	~4 GB
200K	5	108.5	100.51-122.82	~6 GB
262K (full native)	10	107.54	101.70-119.38	~8 GB

Context

Runs

AVG t/s

Range

KV cache (turbo3)

96K

106.67

97.84-115.36

~3 GB

128K

116.0

107.12-127.32

~4 GB

200K

108.5

100.51-122.82

~6 GB

262K (full native)

107.54

101.70-119.38

~8 GB

Path	Context	t/s	Stack
BeeLlama (this)	262K FULL	107.54	llama.cpp fork + DFlash + turbo3 KV
vLLM Genesis Turbo	88K	88	vLLM + 28 patches + MTP n=3 + TurboQuant K8V4
buun-DFlash	96K	76	llama.cpp + DFlash (no MTP claim, no CopySpec)
llama.cpp MTP	262K FULL	72.75	am17an MTP branch + unsloth UD-Q3_K_XL + q4_0 KV

Path

Context

t/s

Stack

BeeLlama (this)

262K FULL

107.54

llama.cpp fork + DFlash + turbo3 KV

vLLM Genesis Turbo

88K

vLLM + 28 patches + MTP n=3 + TurboQuant K8V4

buun-DFlash

96K

llama.cpp + DFlash (no MTP claim, no CopySpec)

llama.cpp MTP

262K FULL

72.75

am17an MTP branch + unsloth UD-Q3_K_XL + q4_0 KV

r/Olares

Qwen3.6-27B DFlash on a 24GB RTX 5090 Laptop (sm_120) — 80 t/s avg via spiritbuun's buun-llama-cpp + Q8_0 GGUF drafter

The journey (with timestamps)

Bench numbers

Comparison on the same hardware

What's actually new

The recipe

Things I haven't tried that should push 100+ t/s

Bonus: PFlash also lands today

Worth noting: llama.cpp MTP also entered beta today

Credits

First sm_120 BeeLlama.cpp benchmark on consumer Blackwell mobile: 107 t/s at FULL 262K context on Qwen3.6 27B (+48% vs MTP, +22% vs vLLM Genesis)

Stack

Methodology

Context sweep on RTX 5090M

Comparison vs my other Qwen3.6 27B paths on the same hardware

Fork chain (for context)

Why it wins over MTP @ same 262K (analysis, not certainty)

Gotchas

Reproducible