r/Olares

▲ 85 r/Olares+1 crossposts

Qwen3.6-27B DFlash on a 24GB RTX 5090 Laptop (sm_120) — 80 t/s avg via spiritbuun's buun-llama-cpp + Q8_0 GGUF drafter

Following the recent flurry of DFlash work (z-lab paper, Lucebox port, spiritbuun fork), I tried to reproduce on consumer Blackwell mobile — a small home box with an RTX 5090 Laptop GPU (24GB GDDR7, 896 GB/s, sm_120).

TL;DR: 73.94 / 80.31 / 85.06 t/s on three Space Invaders generations (max_tokens=800). AVG ~80 t/s. Going from 0.97 t/s catastrophic to 80 t/s in one week, thanks to spiritbuun's fix to my issue #35.

The journey (with timestamps)

  • 2026-04-28 — I publish a blog post titled "Why DFlash on Qwen3.6-27B doesn't fit on 24GB single GPU". Argument: z-lab drafter is 6 GiB BF16, doesn't fit after the target.
  • 2026-04-30spiritbuun/Qwen3.6-27B-DFlash-GGUF lands on HF. Q8_0 drafter at 1.75 GB. VRAM math suddenly works.
  • 2026-04-30 — I build spiritbuun/buun-llama-cpp for sm_120 (CUDA 13.1 + -DGGML_CUDA_NO_VMM=ON + -DCMAKE_CUDA_ARCHITECTURES=120 + libcuda.so.1 stub link). First bench: 3.4 → 1.5 → 0.97 t/s, degrading run over run. File issue #35.
  • 2026-05-01 — spiritbuun replies: "I think this may be fixed now - can you repull and give it another try?"
  • 2026-05-04 — Rebuild with HEAD aecbbd5d (8 commits past my v0.1.0, notably cab1fb597 dflash: add p_min confidence threshold + adaptive draft length). Re-bench: 80 t/s avg.

Bench numbers

Run 1: 800 tok in 10.82s = 73.94 t/s
Run 2: 800 tok in  9.96s = 80.31 t/s
Run 3: 800 tok in  9.41s = 85.06 t/s

Comparison on the same hardware

Backend Stack t/s avg
llama.cpp standard UD-Q4_K_XL, no spec 33-36
vLLM Turbo v0.20.0 + Sandermage Genesis + TurboQuant K8V4 + MTP n=3 88
buun-llama-cpp DFlash HEAD aecbbd5d + Q8_0 GGUF drafter 80
vLLM vanilla (different setup) 0.19.1 + AutoRound INT4 + MTP n=3 99 peak

For context: Lucebox already published DFlash on RTX 3090 24GB at 78 t/s HumanEval / 70 t/s Math500 (sm_86 Ampere) using their custom engine + BF16 z-lab drafter. Today's Lucebox PR #86 reports 218 t/s on RTX 5090 desktop 32GB. So our 80 t/s on RTX 5090M 24GB sits right between Lucebox 3090 and Lucebox 5090 desktop, on a different stack (buun fork instead of Lucebox custom).

What's actually new

  • First public DFlash result via buun-llama-cpp on sm_120 mobile (Lucebox path uses their own engine; Lucebox 5090 desktop on PR #86 used a custom build, not buun)
  • First reproduction confirming the cab1fb597 perf fix on real 24GB consumer hardware (was untested before)
  • Stack uses Q8_0 quantized drafter (not BF16) — frees enough VRAM that the math just works, no compromises elsewhere

The recipe

Image: built from spiritbuun/buun-llama-cpp master HEAD with:

cmake -B build \
  -DGGML_CUDA=ON -DGGML_CUDA_NO_VMM=ON \
  -DCMAKE_CUDA_ARCHITECTURES=120 \
  -DCMAKE_EXE_LINKER_FLAGS="-Wl,-rpath-link,/usr/local/cuda/lib64/stubs" \
  -DCMAKE_SHARED_LINKER_FLAGS="-Wl,-rpath-link,/usr/local/cuda/lib64/stubs" \
  -DCMAKE_BUILD_TYPE=Release && \
cmake --build build --target llama-server -j$(nproc)

llama-server args:

--model unsloth/Qwen3.6-27B-Q4_K_M.gguf
--model-draft spiritbuun/dflash-draft-3.6-q8_0.gguf
--spec-type dflash
--n-gpu-layers 99 --n-gpu-layers-draft 99
--ctx-size 32000 --ctx-size-draft 256
--batch-size 256 --ubatch-size 64
--parallel 1 --flash-attn on --jinja
--chat-template-kwargs '{"enable_thinking": false}'

Important: disable thinking (enable_thinking: false). spiritbuun's README notes the drafter wasn't trained on the think-wrapped distribution — leaving thinking on collapses acceptance and gives ~1.8× less throughput.

Things I haven't tried that should push 100+ t/s

  • DDTree budget tuning (Lucebox uses 22 for 218 t/s on desktop 5090, default in buun likely sub-optimal)
  • --no-fused-gdn ON vs OFF — recent buun commit 905483277 added this debug flag
  • p_min adaptive draft length sweep
  • Pushing context to 64-80K (32K is conservative)

Bonus: PFlash also lands today

While I was writing this up, u/sandropuppo posted PFlash — speculative prefill, complementary to DFlash decode. 10× faster TTFT at 128K on RTX 3090. The pflash/ dir was merged into Lucebox-hub main today. Combining DFlash decode (this post) + PFlash prefill on consumer 24GB Blackwell would close the long-context UX gap completely. Next bench session.

Worth noting: llama.cpp MTP also entered beta today

Same day, u/ilintar posted that llama.cpp MTP is in beta thanks to am17anPR #22673, tested on Qwen3.6 27B + Qwen3.6 35B-A3B with 75% acceptance at 3 draft tokens and 2× speedup over baseline. Depends on the partial seq_rm for GDN PR #22400 we needed for hybrid spec decoding. So llama.cpp now has BOTH MTP (PR #22673) AND DFlash (this post via buun fork) paths — feature parity with vLLM is closing fast.

Credits

  • spiritbuun for the fork + the Q8_0 drafter + the 24h fix turnaround
  • z-lab/dflash for the block-diffusion method
  • Lucebox for proving the 24GB consumer DFlash path on RTX 3090 first
  • unsloth for the Qwen3.6-27B Q4_K_M GGUF target

Full write-up with timestamps and all the iteration mistakes: https://airelien.dev/en/posts/dflash-27b-24gb-debloque/ (EN, FR also at /fr/posts/).

Anyone with a 5090M / 4080M / 3090 24GB who wants to reproduce, I'd love to see your numbers.

u/aurelienams — 10 days ago
▲ 29 r/Olares+2 crossposts

First sm_120 BeeLlama.cpp benchmark on consumer Blackwell mobile: 107 t/s at FULL 262K context on Qwen3.6 27B (+48% vs MTP, +22% vs vLLM Genesis)

Saw the BeeLlama.cpp post here last week claiming 135 t/s on Qwen3.6 27B Q5 + vision + 200K context on a single RTX 3090. Sounded too good. My best Qwen3.6 27B path on Olares One (RTX 5090M Laptop, 24GB GDDR7, 896 GB/s, sm_120 Blackwell consumer mobile, Core Ultra 9 275HX, 96GB DDR5) was 88 t/s on vLLM + Genesis 28-patch + MTP n=3, or 72.75 t/s at FULL 262K on llama.cpp + MTP.

Built BeeLlama from source for sm_120, tested it. The post wasn't cherry-picked.

TL;DR: 107.54 t/s AVG (10 clean runs, range 101.70-119.38) at FULL 262K context. Zero CUDA OOM. Zero degradation cycle. New strict best Qwen3.6 27B path on consumer Blackwell — fastest AND longest in one stack.

Stack

  • Custom image: aamsellem/beellama-cpp:0.1.1 (amd64 + CUDA 13 + sm_120, built from Anbeeld/beellama.cpp v0.1.1)
  • Target: unsloth/Qwen3.6-27B-GGUFUD-Q3_K_XL (14.5 GB, NOT the MTP-baked variant — BeeLlama uses DFlash spec decoding, not MTP)
  • Drafter: spiritbuun/Qwen3.6-27B-DFlash-GGUFdflash-draft-3.6-q8_0.gguf (1.85 GB)
  • KV cache: turbo3 (3-bit Walsh-Hadamard rotated, ~25% smaller than q4_0)
  • Spec: --spec-type dflash --spec-dflash-cross-ctx 1024
  • Batch: 2048, ubatch: 256, flash-attn on, mlock, no-mmap

Methodology

Space Invaders HTML prompt, 2000 tokens, temp 0.6 / top_k 20 / min_p 0.0. 2 warmups + 10 measured runs at each context size.

Context sweep on RTX 5090M

Context Runs AVG t/s Range KV cache (turbo3)
96K 10 106.67 97.84-115.36 ~3 GB
128K 5 116.0 107.12-127.32 ~4 GB
200K 5 108.5 100.51-122.82 ~6 GB
262K (full native) 10 107.54 101.70-119.38 ~8 GB

Perf is essentially flat across context sizes. turbo3 KV scales gracefully — even at 262K full native the stack fits on 24 GB with headroom. No 5-fast/4-slow cycle like the one I posted about Gemma 4 DFlash on vLLM last week.

The 128K sweet spot is real and reproducible. Best guess is cudagraph capture sizes aligning with prefill chunks at exactly that range.

Comparison vs my other Qwen3.6 27B paths on the same hardware

Path Context t/s Stack
BeeLlama (this) 262K FULL 107.54 llama.cpp fork + DFlash + turbo3 KV
vLLM Genesis Turbo 88K 88 vLLM + 28 patches + MTP n=3 + TurboQuant K8V4
buun-DFlash 96K 76 llama.cpp + DFlash (no MTP claim, no CopySpec)
llama.cpp MTP 262K FULL 72.75 am17an MTP branch + unsloth UD-Q3_K_XL + q4_0 KV
  • +48% vs MTP at same 262K target quant
  • +22% vs vLLM Genesis Turbo at 1/3 the context
  • +40% vs buun-DFlash at less context

Fork chain (for context)

ggml-org/llama.cppTheTom/llama-cpp-turboquant (turbo2/3/4 KV) → spiritbuun/buun-llama-cpp (DFlash for Qwen 3.6) → Anbeeld/beellama.cpp (MTP claim, CopySpec, reasoning-loop protection)

None of these forks publish a Linux Docker image for sm_120. The build via docker buildx --platform linux/amd64 --build-arg CUDA_DOCKER_ARCH=120 from an M-series Mac took ~50 min through qemu emulation. Image is 2.67 GB, on Docker Hub as aamsellem/beellama-cpp:0.1.1.

Why it wins over MTP @ same 262K (analysis, not certainty)

Three combined factors:

  1. DFlash drafter vs MTP head: spiritbuun's q8_0 DFlash drafter for Qwen 3.6 was specifically tuned by z-lab on Qwen 3.6's output distribution. Higher accept than the MTP head baked into havenoammo's GGUF.
  2. turbo3 vs q4_0 KV: ~25% smaller → more compute buffer headroom → bigger batch.
  3. batch 2048 / ubatch 256 vs 512/512: more prefill packing per scheduler cycle.

I haven't isolated which of the three contributes the most yet — that's the next bench.

Gotchas

  • If you have a havenoammo/Qwen3.6-27B-MTP-UD-GGUF cached, BeeLlama refuses to load it: done_getting_tensors: wrong number of tensors; expected 866, got 862. MTP head bakes 4 tensors BeeLlama's loader doesn't recognize. Use the non-MTP unsloth variant.
  • Multi-GPU broken in this fork (issue #7). Single-GPU only.
  • BeeLlama hasn't synced upstream master since April 23 — won't get new llama.cpp builds (b9130+) until Anbeeld rebases.
  • No Genesis 28-patch maintenance burden, but you do depend on Anbeeld maintaining the fork.

Reproducible

Helm chart, exact image tag, all flags, bench harness: https://github.com/aamsellem/olares-one-market/tree/main/llamacppqwen36beellamaone (v1.0.1).

If you run a different sm_120 card (5070 Ti, 5080, 5090 desktop, 5090M), the aamsellem/beellama-cpp:0.1.1 image should work as-is. 5090 desktop with 32GB and 1.79 TB/s should land around 150-180 t/s if my mobile-to-desktop bandwidth scaling holds — let me know your numbers.


Hardware: Olares One (RTX 5090M Laptop, 24GB, sm_120 Blackwell mobile) Image: aamsellem/beellama-cpp:0.1.1 (custom build, source: https://github.com/Anbeeld/beellama.cpp v0.1.1) Helm chart: https://github.com/aamsellem/olares-one-market/tree/main/llamacppqwen36beellamaone Full blog writeup: https://airelien.dev/en/posts/beellama-cpp-262k-blackwell-mobile/

reddit.com
u/aurelienams — 13 hours ago