u/sandropuppo

Luce DFlash + PFlash on AMD Strix Halo: Qwen3.6-27B at 2.23x decode and 3.05x prefill vs llama.cpp HIP
▲ 109 r/StrixHalo+1 crossposts

Luce DFlash + PFlash on AMD Strix Halo: Qwen3.6-27B at 2.23x decode and 3.05x prefill vs llama.cpp HIP

Hey fellow Llamas, keeping it short.

We just shipped DFlash and PFlash support for the AMD Ryzen AI MAX+ 395 iGPU (gfx1151, Strix Halo, 128 GiB unified memory). Same Luce DFlash stack from the RTX 3090 post a couple weeks back, now running on the consumer AMD APU class.

Repo: https://github.com/Luce-Org/lucebox-hub (MIT)

TL;DR

End-to-end on Qwen3.6-27B Q4_K_M with the Luce Q8_0 DFlash drafter: 26.85 tok/s decode and 20.2 s prefill at 16K context.

That is 2.23x faster decode and 3.05x faster prefill than llama.cpp HIP on the same silicon. At a 16K prompt + 1K generation workload, total wall clock drops from 147 s to 58 s, 2.5x faster end to end.

The same 128 GiB box hosts checkpoints up to ~100 GiB, a class of models a 24 GiB consumer GPU cannot touch (Qwen3.5-122B-A10B, MiniMax-M2.7-REAP 139B-A10B, full BF16 27B).

The numbers

Hardware: Ryzen AI MAX+ 395, Radeon 8060S iGPU (gfx1151), 128 GiB LPDDR5X-8000, ROCm 7.2.2 Target: Qwen3.6-27B Q4_K_M (15.65 GiB) Drafter: Lucebox/Qwen3.6-27B-DFlash-GGUF Q8_0 with DFLASH27B_DRAFT_SWA=2048 Bench: 10-prompt HumanEval-style, --n-gen 128 --ddtree-budget 22 --fast-rollback

Decode (Qwen3.6-27B Q4_K_M, tok/s):

Engine tok/s vs AR
llama.cpp HIP AR 12.02 1.00x
llama.cpp Vulkan AR 12.45 1.04x
Luce DFlash (this PR) 26.85 2.23x

Prefill (Qwen3.6-27B, 16K tokens):

Engine TTFT vs AR
llama.cpp HIP AR 61.69 s 1.00x
Luce PFlash 20.2 s 3.05x

Speedup grows with context: PFlash compress is O(S), AR prefill is O(S^2). NIAH retrieval still passes at 16K.

Tuning note: --ddtree-budget=22 is the gfx1151 optimum. Higher budgets accept more tokens per step but each step gets more expensive on LPDDR5X. Bandwidth caps the benefit before tile utilization pays off. Contrast with gfx1100 (7900 XTX, GDDR6 936 GB/s) where budget=8 wins, tile waste matters more than launch amortization. Default ship is arch-aware.

Reproduce

bash

# 1. Build PR #119 for gfx1151
git clone https://github.com/Luce-Org/lucebox-hub.git
cd lucebox-hub
git fetch origin pull/119/head:pr119 && git checkout pr119
git submodule update --init --recursive
cd dflash
cmake -B build -S . \
  -DCMAKE_BUILD_TYPE=Release \
  -DDFLASH27B_GPU_BACKEND=hip \
  -DDFLASH27B_HIP_ARCHITECTURES=gfx1151 \
  -DDFLASH27B_HIP_SM80_EQUIV=ON
cmake --build build --target test_dflash -j

# 2. Models: Qwen3.6-27B target + Lucebox Q8_0 DFlash drafter
mkdir -p models/draft
hf download unsloth/Qwen3.6-27B-GGUF Qwen3.6-27B-Q4_K_M.gguf --local-dir models/
hf download Lucebox/Qwen3.6-27B-DFlash-GGUF dflash-draft-3.6-q8_0.gguf --local-dir models/draft/

# 3. Bench (DFlash decode + PFlash long-context prefill)
LD_LIBRARY_PATH=/opt/rocm/lib:$LD_LIBRARY_PATH \
DFLASH_BIN=$PWD/build/test_dflash \
DFLASH_TARGET=$PWD/models/Qwen3.6-27B-Q4_K_M.gguf \
DFLASH_DRAFT=$PWD/models/draft/dflash-draft-3.6-q8_0.gguf \
DFLASH27B_DRAFT_SWA=2048 \
DFLASH27B_PREFILL_UBATCH=512 \
  python3 scripts/bench_he.py --n-gen 128 --ddtree-budget 22

DFLASH27B_PREFILL_UBATCH=512 applies the PR #159 fix on top of PR #119. Once #159 merges, this is the daemon default.

What is still missing

  • BSA scoring kernel on HIP. The drafter compress-score path uses BSA (block-sparse attention) on CUDA. PR #119 disables it on HIP and falls back to ggml's flash_attn_ext, which the daemon's own warning flags as ~3.4x slower. A rocWMMA-native sparse-FA kernel closes the gap. After it lands, PFlash TTFT at 16K drops from 27.6 s to roughly 8 s. At 128K, projected 7-10x over llama.cpp AR.
  • Multi-row q4_K decode GEMV. RDNA-native multi-row pattern (R=4-8 output rows sharing activation register state) for the drafter forward, currently 30% of compress time at long context.
  • Phase 2 tile shape tuning for gfx1151. Current rocWMMA flashprefill tiles are tuned for gfx1100. Strix Halo has different LDS and VGPR characteristics.
  • 70B+ MoE targets. 128 GiB headroom is wasted on a 27B. Qwen3.5-122B-A10B and MiniMax-M2.7-REAP 139B-A10B both fit. DFlash math ports cleanly to MoE; big work is wiring the expert-routed forward into the spec verify loop.

Constraints

ROCm 7.2.2+, gfx1151 tuned (gfx1100 also supported with arch-aware defaults), greedy verify only, no Vulkan / Metal / multi-GPU on this path yet.

We're working hard on this but we know we need to improve on many things.

Feedback is more than welcome :)

u/sandropuppo — 1 day ago

Hey fellow Llamas, thank you for all the nice words and great feedback on the last post I made. We have something new we thought would be useful to share. As always your time is precious, so I'll keep it short.

We built speculative prefill for long-context decode on quantized 27B targets, C++/CUDA only. A small drafter loaded in-process scores token importance over the full prompt; the heavy target only prefills the spans that matter.

Repo: github.com/Luce-Org/lucebox-hub (open source, MIT).

Head-to-head on Qwen3.6-27B Q4_K_M, RTX 3090, single-shot: 24.8 s TTFT vs ~257 s for vanilla llama.cpp = ~10.4× at 128K (and 13.5 s vs 134.95 s = 10.0× at 64K), with NIAH retrieval preserved end-to-end. No Python, no Triton, no PyTorch in the inference loop.

The problem

Q4_K_M Qwen3.6-27B on a 24 GB 3090 decodes fast (~74 tok/s with DFlash spec decode), but prefill scales O(S²). On a 131K-token prompt, vanilla llama.cpp takes 248.4 s cold (llama-bench pp131072 --no-warmup -r 1, 527.6 tok/s). That is 4.1 minutes staring at a blank screen before the first token. Decode is fast, but the wait kills the UX. Warmed steady-state is better (169.3 s at 128K) but still painful, and grows quadratically as you push context.

Standing on shoulders

This work stands on two recent papers, both excellent reads:

  • Speculative Prefill (Liu et al, arXiv 2502.02789) and Cross-Family Speculative Prefill (SambaNova, ICLR 2026). Insight: a small draft model's attention pattern over a long prompt faithfully predicts which tokens matter for the answer. Run the draft, score per-token importance, keep the top spans, drop the rest.
  • FlashPrefill (Fan et al, 2026). Block-sparse attention so the drafter itself does not pay O(S²) at 128K.
  • mit-han-lab/Block-Sparse-Attention (BSA) for the FA-2-derived sm_80+ sparse forward.
  • ggml / llama.cpp for the runtime. We link libggml*.a and never libllama.

Our contribution is the C++/CUDA composition of these two algorithms, in-process, on a 24 GB consumer card. As far as we are aware, the two papers had not been combined in an open implementation before.

What we built

  • In-process composition. Drafter forward (custom Qwen3-0.6B BF16 ggml graph), FlashPrefill scoring, sparse attention, target prefill, and DFlash spec decode all run in one C++/CUDA process sharing one ggml allocator. No subprocess, no IPC, no Python, Triton, or PyTorch in the inference loop.
  • CUDA port of FlashPrefill. The reference (qhfan/FlashPrefill) is Triton. We wrote 4 CUDA kernels from scratch (mean_K, score, select, sparse_fwd) and dispatched the sparse forward through mit-han-lab/Block-Sparse-Attention. BSA ships as a libtorch C++ extension; pulling 2 GB of libtorch into a 24 GB inference loop was a non-starter, so we wired it in via a 3-header ATen/c10 stub set under dflash/deps/bsa_stubs/.
  • 24 GB memory orchestration. Drafter (1.3 GB weights + KV + ~600 MB BSA scratch at 128K) and the DFlash daemon (15 GB target + 3 GB draft + 3 GB KV) do not coexist on a 3090. The daemon parks, unparks, and frees weights between stages over a stdin protocol; ~3 s per request, makes the whole pipeline fit on a single consumer card.

Setup

bash

# clone with submodules (pulls llama.cpp/ggml + Block-Sparse-Attention)
git clone --recurse-submodules https://github.com/Luce-Org/lucebox-hub
cd lucebox-hub/dflash

# build dflash + BSA kernel (sm_80+, ~10 min cold compile pulls cutlass)
cmake -B build -S . -DCMAKE_BUILD_TYPE=Release \
                    -DCMAKE_CUDA_ARCHITECTURES=86 \
                    -DDFLASH27B_ENABLE_BSA=ON
cmake --build build --target test_dflash test_flashprefill_kernels -j

# fetch weights (target + drafter + spec-decode draft)
huggingface-cli download unsloth/Qwen3.6-27B-GGUF Qwen3.6-27B-Q4_K_M.gguf --local-dir models/
huggingface-cli download Qwen/Qwen3-0.6B model.safetensors tokenizer.json --local-dir models/drafter/
huggingface-cli download z-lab/Qwen3.6-27B-DFlash --local-dir models/draft/

# bench
cd ../pflash && pip install -e .
python tests/niah_gen.py --n 1 --ctx 131072 --out /tmp/niah_128k.jsonl
python tests/bench_niah_cpp.py \
  --bin    ../dflash/build/test_dflash \
  --target ../dflash/models/Qwen3.6-27B-Q4_K_M.gguf \
  --draft  ../dflash/models/draft/model.safetensors \
  --drafter-gguf ../dflash/models/drafter/qwen3-0.6b.gguf \
  --cases /tmp/niah_128k.jsonl --keep-ratio 0.05

Numbers

Single-shot on RTX 3090, Qwen3.6-27B Q4_K_M target, q4_0 KV, DFLASH_FP_USE_BSA=1 DFLASH_FP_ALPHA=0.85 keep_ratio=0.05. NIAH single-needle as the end-to-end retrieval check. Baseline is vanilla llama.cpp with default f16 KV (apples-to-oranges on KV; q4_0 KV costs ~3% AL at short context, 8.56 to 8.33, benchmarked).

Context PFlash TTFT llama.cpp cold Speedup (cold) llama.cpp warmed
64K 13.5 s 134.95 s 10.0x (smaller)
128K 24.8 s 248.4 s 10.0x 169.3 s

These are cold-cache numbers (first request after process boot). Warmed-vs-warmed is a smaller multiplier because llama.cpp settles into ~169 s at 128K once caches are hot. Both numbers are real and the right one depends on your workload; if you keep an engine resident, use warmed.

Decode after prefill is the standard DFlash spec-decode path with DDTree (~74 tok/s sustained on Qwen3.6-27B Q4_K_M).

Quality

NIAH single-needle (magic-key + 7-digit answer randomly placed in filler) retrieved at every context tested from 32K through 128K, keep_ratio=0.05, DFLASH_FP_ALPHA=0.85.

Honest flag: NIAH single-needle is a structurally easy probe for an attention-based selection method like ours, since the algorithm is well-suited to finding a single high-attention span. RULER and NIAH multi-needle are next on the list; a fair audit should wait for those numbers.

Why the stack works

Speculative prefill solves a quality problem: how do you compress without losing the answer-relevant content? FlashPrefill solves a speed problem inside the drafter step: how do you make the drafter fast enough at 128K that it doesn't become the bottleneck. They compose cleanly because the target side (DFlash spec decode) is unchanged; it just receives a much shorter prompt with full attention enabled.

At 128K, drafter scoring is now the dominant cost (~12 s of the 24.8 s TTFT). Target prefill on the compressed ~6.5K survivors is ~10 s; the remaining ~3 s is the park/unpark/free dance. The next obvious lever is a smaller or distilled drafter, which we have not done yet.

Tuning

bash

DFLASH_FP_USE_BSA=1     # dispatch sparse FA forward through BSA (sm_80+, required for 10x)
DFLASH_FP_ALPHA=0.85    # block-selection threshold; higher = stricter = fewer K-blocks per Q-row
DFLASH_FP_PROFILE=1     # log per-stage timings (mean_K / score / select / forward)

keep_ratio=0.05 is the default. 0.02 cuts target prefill from ~10 s to ~3 s but starts losing the needle. DFLASH_FP_ALPHA=0.99 cuts ~1 s at 128K with a small NIAH-margin loss. Calibration territory.

Any feedback is more than welcome!

u/sandropuppo — 13 days ago

Hey fellow Local lovers, your time is precious, so I'll keep it short.

We built a GGUF port of DFlash speculative decoding. Standalone C++/CUDA stack on top of ggml, runs on a single 24 GB RTX 3090, hosts the new Qwen3.6-27B.

We call it Luce DFlash (Open source; MIT)

~1.98x mean over autoregressive on Qwen3.6 across HumanEval / GSM8K / Math500, with zero retraining (z-lab published a matched Qwen3.6-DFlash draft on 2026-04-26, still under training, so AL should keep climbing).

If you have CUDA 12+ and an NVIDIA GPU (RTX 3090 / 4090 / 5090, DGX Spark, other Blackwell, or Jetson AGX Thor with CUDA 13+), all you need is

# After cloning the repo (link in the first comment):

cd lucebox-hub/dflash

cmake -B build -S . -DCMAKE_BUILD_TYPE=Release

cmake --build build --target test_dflash -j

# Fetch target (~16 GB)

huggingface-cli download unsloth/Qwen3.6-27B-GGUF Qwen3.6-27B-Q4_K_M.gguf --local-dir models/

# Matched 3.6 draft is gated: accept terms + set HF_TOKEN first

huggingface-cli download z-lab/Qwen3.6-27B-DFlash --local-dir models/draft/

# Run

DFLASH_TARGET=models/Qwen3.6-27B-Q4_K_M.gguf python3 scripts/run.py --prompt "def fibonacci(n):"

That's it. No Python runtime in the engine, no llama.cpp install, no vLLM, no SGLang. The binary links libggml*.a and never libllama.

Luce DFlash will

  • Load Qwen3.6-27B Q4_K_M target weights (~16 GB) plus the matched DFlash bf16 draft (~3.46 GB) and run DDTree tree-verify speculative decoding (block size 16, default budget 22, greedy verify).
  • Compress the KV cache to TQ3_0 (3.5 bpv, ~9.7x vs F16) and roll a 4096-slot target_feat ring so 256K context fits in 24 GB. Q4_0 is the legacy path and tops out near 128K.
  • Auto-bump the prefill ubatch from 16 to 192 for prompts past 2048 tokens (~913 tok/s prefill on 13K prompts).
  • Apply sliding-window flash attention at decode (default 2048-token window, 100% speculative acceptance retained) so 60K context still decodes at 89.7 tok/s instead of 25.8 tok/s.
  • Serve over an OpenAI-compatible HTTP endpoint or a local chat REPL.

Running on RTX 3090, Qwen3.6-27B UD-Q4_K_XL (unsloth Dynamic 2.0) target, 10 prompts/dataset, n_gen=256:

Bench AR tok/s DFlash tok/s AL Speedup

HumanEval 34.90 78.16 5.94 2.24x

Math500 35.13 69.77 5.15 1.99x

GSM8K 34.89 59.65 4.43 1.71x

Mean 34.97 69.19 5.17 1.98x

As you can see, the speedup is real on consumer hardware, not a paper number. Target graph produces bit-identical output to autoregressive in AR mode; the draft graph matches the z-lab PyTorch reference at cos sim 0.999812. Q4_0 KV costs ~3% AL at short context (8.56 to 8.33) and wins at long context where F16 won't fit anyway.

Constraints: CUDA only, greedy verify only (temperature/top_p on the OpenAI server are accepted and ignored), no Metal / ROCm / multi-GPU. Repo started single-3090, recent community PRs added support for RTX 5090, DGX Spark / GB10, other Blackwell cards, and Jetson AGX Thor (sm_110 + CUDA 13).

Feedback more than welcome!

u/sandropuppo — 17 days ago

Hey fellow Llamas, your time is precious, so I'll keep it short.

We built a GGUF port of DFlash speculative decoding. Standalone C++/CUDA stack on top of ggml, runs on a single 24 GB RTX 3090, hosts the new Qwen3.6-27B.

We call it Luce DFlash (https://github.com/Luce-Org/lucebox-hub; MIT)

~1.98x mean over autoregressive on Qwen3.6 across HumanEval / GSM8K / Math500, with zero retraining (z-lab published a matched Qwen3.6-DFlash draft on 2026-04-26, still under training, so AL should keep climbing).

If you have CUDA 12+ and an NVIDIA GPU (RTX 3090 / 4090 / 5090, DGX Spark, other Blackwell, or Jetson AGX Thor with CUDA 13+), all you need is

# After cloning the repo (link in the first comment):

cd lucebox-hub/dflash

cmake -B build -S . -DCMAKE_BUILD_TYPE=Release

cmake --build build --target test_dflash -j

# Fetch target (~16 GB)

huggingface-cli download unsloth/Qwen3.6-27B-GGUF Qwen3.6-27B-Q4_K_M.gguf --local-dir models/

# Matched 3.6 draft is gated: accept terms + set HF_TOKEN first

huggingface-cli download z-lab/Qwen3.6-27B-DFlash --local-dir models/draft/

# Run

DFLASH_TARGET=models/Qwen3.6-27B-Q4_K_M.gguf python3 scripts/run.py --prompt "def fibonacci(n):"

That's it. No Python runtime in the engine, no llama.cpp install, no vLLM, no SGLang. The binary links libggml*.a and never libllama.

Luce DFlash will

  • Load Qwen3.6-27B Q4_K_M target weights (~16 GB) plus the matched DFlash bf16 draft (~3.46 GB) and run DDTree tree-verify speculative decoding (block size 16, default budget 22, greedy verify).
  • Compress the KV cache to TQ3_0 (3.5 bpv, ~9.7x vs F16) and roll a 4096-slot target_feat ring so 256K context fits in 24 GB. Q4_0 is the legacy path and tops out near 128K.
  • Auto-bump the prefill ubatch from 16 to 192 for prompts past 2048 tokens (~913 tok/s prefill on 13K prompts).
  • Apply sliding-window flash attention at decode (default 2048-token window, 100% speculative acceptance retained) so 60K context still decodes at 89.7 tok/s instead of 25.8 tok/s.
  • Serve over an OpenAI-compatible HTTP endpoint or a local chat REPL.

Running on RTX 3090, Qwen3.6-27B UD-Q4_K_XL (unsloth Dynamic 2.0) target, 10 prompts/dataset, n_gen=256:

Bench AR tok/s DFlash tok/s AL Speedup

HumanEval 34.90 78.16 5.94 2.24x

Math500 35.13 69.77 5.15 1.99x

GSM8K 34.89 59.65 4.43 1.71x

Mean 34.97 69.19 5.17 1.98x

As you can see, the speedup is real on consumer hardware, not a paper number. Target graph produces bit-identical output to autoregressive in AR mode; the draft graph matches the z-lab PyTorch reference at cos sim 0.999812. Q4_0 KV costs ~3% AL at short context (8.56 to 8.33) and wins at long context where F16 won't fit anyway.

Constraints: CUDA only, greedy verify only (temperature/top_p on the OpenAI server are accepted and ignored), no Metal / ROCm / multi-GPU. Repo started single-3090, recent community PRs added support for RTX 5090, DGX Spark / GB10, other Blackwell cards, and Jetson AGX Thor (sm_110 + CUDA 13).

Feedback more than welcome!

u/sandropuppo — 17 days ago