u/aurelienams

▲ 40 r/Olares+2 crossposts

First sm_120 BeeLlama.cpp benchmark on consumer Blackwell mobile: 107 t/s at FULL 262K context on Qwen3.6 27B (+48% vs MTP, +22% vs vLLM Genesis)

Saw the BeeLlama.cpp post here last week claiming 135 t/s on Qwen3.6 27B Q5 + vision + 200K context on a single RTX 3090. Sounded too good. My best Qwen3.6 27B path on Olares One (RTX 5090M Laptop, 24GB GDDR7, 896 GB/s, sm_120 Blackwell consumer mobile, Core Ultra 9 275HX, 96GB DDR5) was 88 t/s on vLLM + Genesis 28-patch + MTP n=3, or 72.75 t/s at FULL 262K on llama.cpp + MTP.

Built BeeLlama from source for sm_120, tested it. The post wasn't cherry-picked.

TL;DR: 107.54 t/s AVG (10 clean runs, range 101.70-119.38) at FULL 262K context. Zero CUDA OOM. Zero degradation cycle. New strict best Qwen3.6 27B path on consumer Blackwell — fastest AND longest in one stack.

Stack

  • Custom image: aamsellem/beellama-cpp:0.1.1 (amd64 + CUDA 13 + sm_120, built from Anbeeld/beellama.cpp v0.1.1)
  • Target: unsloth/Qwen3.6-27B-GGUFUD-Q3_K_XL (14.5 GB, NOT the MTP-baked variant — BeeLlama uses DFlash spec decoding, not MTP)
  • Drafter: spiritbuun/Qwen3.6-27B-DFlash-GGUFdflash-draft-3.6-q8_0.gguf (1.85 GB)
  • KV cache: turbo3 (3-bit Walsh-Hadamard rotated, ~25% smaller than q4_0)
  • Spec: --spec-type dflash --spec-dflash-cross-ctx 1024
  • Batch: 2048, ubatch: 256, flash-attn on, mlock, no-mmap

Methodology

Space Invaders HTML prompt, 2000 tokens, temp 0.6 / top_k 20 / min_p 0.0. 2 warmups + 10 measured runs at each context size.

Context sweep on RTX 5090M

Context Runs AVG t/s Range KV cache (turbo3)
96K 10 106.67 97.84-115.36 ~3 GB
128K 5 116.0 107.12-127.32 ~4 GB
200K 5 108.5 100.51-122.82 ~6 GB
262K (full native) 10 107.54 101.70-119.38 ~8 GB

Perf is essentially flat across context sizes. turbo3 KV scales gracefully — even at 262K full native the stack fits on 24 GB with headroom. No 5-fast/4-slow cycle like the one I posted about Gemma 4 DFlash on vLLM last week.

The 128K sweet spot is real and reproducible. Best guess is cudagraph capture sizes aligning with prefill chunks at exactly that range.

Comparison vs my other Qwen3.6 27B paths on the same hardware

Path Context t/s Stack
BeeLlama (this) 262K FULL 107.54 llama.cpp fork + DFlash + turbo3 KV
vLLM Genesis Turbo 88K 88 vLLM + 28 patches + MTP n=3 + TurboQuant K8V4
buun-DFlash 96K 76 llama.cpp + DFlash (no MTP claim, no CopySpec)
llama.cpp MTP 262K FULL 72.75 am17an MTP branch + unsloth UD-Q3_K_XL + q4_0 KV
  • +48% vs MTP at same 262K target quant
  • +22% vs vLLM Genesis Turbo at 1/3 the context
  • +40% vs buun-DFlash at less context

Fork chain (for context)

ggml-org/llama.cppTheTom/llama-cpp-turboquant (turbo2/3/4 KV) → spiritbuun/buun-llama-cpp (DFlash for Qwen 3.6) → Anbeeld/beellama.cpp (MTP claim, CopySpec, reasoning-loop protection)

None of these forks publish a Linux Docker image for sm_120. The build via docker buildx --platform linux/amd64 --build-arg CUDA_DOCKER_ARCH=120 from an M-series Mac took ~50 min through qemu emulation. Image is 2.67 GB, on Docker Hub as aamsellem/beellama-cpp:0.1.1.

Why it wins over MTP @ same 262K (analysis, not certainty)

Three combined factors:

  1. DFlash drafter vs MTP head: spiritbuun's q8_0 DFlash drafter for Qwen 3.6 was specifically tuned by z-lab on Qwen 3.6's output distribution. Higher accept than the MTP head baked into havenoammo's GGUF.
  2. turbo3 vs q4_0 KV: ~25% smaller → more compute buffer headroom → bigger batch.
  3. batch 2048 / ubatch 256 vs 512/512: more prefill packing per scheduler cycle.

I haven't isolated which of the three contributes the most yet — that's the next bench.

Gotchas

  • If you have a havenoammo/Qwen3.6-27B-MTP-UD-GGUF cached, BeeLlama refuses to load it: done_getting_tensors: wrong number of tensors; expected 866, got 862. MTP head bakes 4 tensors BeeLlama's loader doesn't recognize. Use the non-MTP unsloth variant.
  • Multi-GPU broken in this fork (issue #7). Single-GPU only.
  • BeeLlama hasn't synced upstream master since April 23 — won't get new llama.cpp builds (b9130+) until Anbeeld rebases.
  • No Genesis 28-patch maintenance burden, but you do depend on Anbeeld maintaining the fork.

Reproducible

Helm chart, exact image tag, all flags, bench harness: https://github.com/aamsellem/olares-one-market/tree/main/llamacppqwen36beellamaone (v1.0.1).

If you run a different sm_120 card (5070 Ti, 5080, 5090 desktop, 5090M), the aamsellem/beellama-cpp:0.1.1 image should work as-is. 5090 desktop with 32GB and 1.79 TB/s should land around 150-180 t/s if my mobile-to-desktop bandwidth scaling holds — let me know your numbers.


Hardware: Olares One (RTX 5090M Laptop, 24GB, sm_120 Blackwell mobile) Image: aamsellem/beellama-cpp:0.1.1 (custom build, source: https://github.com/Anbeeld/beellama.cpp v0.1.1) Helm chart: https://github.com/aamsellem/olares-one-market/tree/main/llamacppqwen36beellamaone Full blog writeup: https://airelien.dev/en/posts/beellama-cpp-262k-blackwell-mobile/

reddit.com
u/aurelienams — 18 hours ago
▲ 7 r/Olares+1 crossposts

Gemma 4 MTP on RTX 5090 Laptop (sm_120 24GB): E2B 206 t/s, 26B-A4B 140 t/s @ 78% accept (beats AtomicChat M5Max ref), E4B 178 t/s via vLLM

Hey everyone — first public Gemma 4 MTP bench on consumer Blackwell mobile that I'm aware of (RTX 5090M Laptop GPU, sm_120, 24GB GDDR7 — the GPU in the new Olares One). Both stacks now have working Gemma 4 MTP support, so I tested all three model variants we have public drafters for.

TL;DR

Stack Model t/s Accept Notes
llama.cpp + AtomicChat fork Gemma 4 E2B 206.6 60.9% Single-stream cap for ~5B model
vLLM nightly + PR #41745 Gemma 4 E4B 178.6 77.3% 100% upstream stack, 1 PR
llama.cpp + AtomicChat fork Gemma 4 26B-A4B 140.0 78.1% Beats AtomicChat M5Max ref (138 t/s)

All three are first runs (no warmup), 3000+ generated tokens each. MTP confirmed firing in logs. Steady state probably 5-10% higher.

Stack 1: vLLM nightly + Gemma 4 E4B (178 t/s, 77% accept)

PR #41745 by lucianommartins merged 2026-05-06 14:39 UTC, nightly Docker published 2026-05-07 06:13 UTC. Image: vllm/vllm-openai:nightly-1acd67a795ebccdf9b9db7697ae9082058301657.

exec vllm serve google/gemma-4-E4B-it \
  --served-model-name gemma-4-e4b-mtp \
  --max-model-len 32000 \
  --gpu-memory-utilization 0.85 \
  --dtype auto \
  --enable-prefix-caching \
  --speculative-config '{"method":"mtp","model":"google/gemma-4-E4B-it-assistant","num_speculative_tokens":3}'

Bench:

Run 1 (cold): 800 tok in 6.17s = 129.73 t/s
Run 2:        800 tok in 4.17s = 191.73 t/s
Run 3:        800 tok in 3.73s = 214.38 t/s
AVG = 178.6 t/s, 77.3% draft acceptance

Stack 2: llama.cpp + Atomic Chat fork + E2B (206 t/s)

Fork: AtomicBot-ai/atomic-llama-cpp-turboquant (branch feature/turboquant-kv-cache). Adds gemma4_assistant arch + TurboQuant KV cache (-ctk turbo3 -ctv turbo3) + --mtp-head runtime flag.

GGUFs: unsloth/gemma-4-E2B-it-GGUF (target Q8_0) + AtomicChat/gemma-4-E2B-it-assistant-GGUF (drafter Q4_K_M, 75 MB).

llama-server \
  --model gemma-4-E2B-it-Q8_0.gguf \
  --mtp-head gemma-4-E2B-it-assistant.Q4_K_M.gguf \
  --spec-type mtp \
  --draft-block-size 3 --draft-max 8 --draft-min 0 \
  -ngl 99 -ngld 99 \
  -ctk turbo3 -ctv turbo3 -ctkd turbo3 -ctvd turbo3 \
  -fa on -c 131072

Bench:

prompt eval: 22 tok in 0.224s = 98.27 t/s
eval:        3198 tok in 15.48s = 206.56 t/s
draft acceptance: 60.93%

Stack 3: llama.cpp + Atomic Chat fork + 26B-A4B (140 t/s, 78% accept)

Same fork, different model. Target unsloth/gemma-4-26B-A4B-it-GGUF/UD-Q4_K_XL.gguf (~17 GB) + drafter AtomicChat/gemma-4-26B-A4B-it-assistant-GGUF Q4_K_M (325 MB).

Bench:

prompt eval: 22 tok in 0.164s = 134.45 t/s
eval:        3238 tok in 23.12s = 140.03 t/s
draft acceptance: 78.15%  (1974 accepted / 2526 generated)

Beats AtomicChat's M5Max reference (138 t/s). Notable because 5090M Laptop has ~75% the bandwidth of an RTX 4090, but the MoE Gemma 4 (3.8B activated of 26B) extracts a lot from it.

Why 78% acceptance is high

For comparison, Qwen3.6 27B + MTP llama.cpp (PR #22673) on the same hardware tops out at ~64% acceptance. The Gemma 4 drafter delivers higher because:

  1. It's trained jointly with the target (not a standalone "small Gemma" repurposed)
  2. The centroid LM head (top_k=32, num_centroids=2048) compresses the 262K vocab to a 4K mask — faster AND more aligned predictions
  3. The 26B-A4B specifically benefits from MoE routing being deterministic at inference, so the drafter can match patterns reliably

VRAM math (24 GB consumer mobile)

Model Quant KV (q4_0 / turbo3) Total Headroom
E2B Q8_0 (4.7 GB) ~1 GB @ 128K ~6 GB 18 GB
E4B (vLLM) auto (6 GB) ~1.5 GB @ 32K ~8 GB 16 GB
26B-A4B Q4_K_XL (17 GB) ~3 GB @ 64K ~20 GB 4 GB

The 26B-A4B is tight — need to bump HAMi cap to 24400m and use turbo3 KV (3-bit Hadamard rotation, more compact than q4_0) to fit comfortably.

What's NOT covered

  • MLX — community is asking on Reddit but no support yet (only mlx-community has the bf16 weights converted)
  • Mainline llama.cpp — AtomicChat fork only for now. Upstream PR will probably follow (their fix for gemma4_assistant arch is small and clean)
  • Vision — Gemma 4 mmproj NOT compatible with MTP in current AtomicChat fork. Text-only for now.

Recipes / charts

For Olares One owners — both stacks are packaged in my market source as installable apps:

  • gemma4e2bone v1.0.2 (E2B + atomic fork)
  • gemma426ba4bone v1.0.9 (26B-A4B + atomic fork)
  • vllmgemma4e4bone (the vLLM E4B path — chart bump pending)

Source URL: https://orales-one-market.aamsellem.workers.dev

Credits

  • Google DeepMind for Gemma 4 + the official MTP drafters (E2B/E4B/26B-A4B/31B)
  • lucianommartins for vLLM PR #41745 (clean architecture, centroids masking with CUDA graph acceleration)
  • AtomicChat team for the llama.cpp fork + MTP-quantized GGUFs (HF collection)
  • vLLM core team for the rapid nightly publishing post-merge

Open questions to the community

  • If you run on other Blackwell consumer cards (5070, 5080, 5090 desktop) — please post your t/s, we don't have those datapoints publicly yet
  • Anyone reproduced the 26B-A4B 78% acceptance on Ampere (3090, 4090) — does it scale similarly?
  • Is there any plan to upstream the AtomicChat fork's gemma4_assistant support to mainline llama.cpp? The patch is small.

Full writeup with timeline + crash logs + comparison vs Qwen3.6 stacks: link to my blog post

reddit.com
u/aurelienams — 6 days ago
▲ 85 r/Olares+1 crossposts

Qwen3.6-27B DFlash on a 24GB RTX 5090 Laptop (sm_120) — 80 t/s avg via spiritbuun's buun-llama-cpp + Q8_0 GGUF drafter

Following the recent flurry of DFlash work (z-lab paper, Lucebox port, spiritbuun fork), I tried to reproduce on consumer Blackwell mobile — a small home box with an RTX 5090 Laptop GPU (24GB GDDR7, 896 GB/s, sm_120).

TL;DR: 73.94 / 80.31 / 85.06 t/s on three Space Invaders generations (max_tokens=800). AVG ~80 t/s. Going from 0.97 t/s catastrophic to 80 t/s in one week, thanks to spiritbuun's fix to my issue #35.

The journey (with timestamps)

  • 2026-04-28 — I publish a blog post titled "Why DFlash on Qwen3.6-27B doesn't fit on 24GB single GPU". Argument: z-lab drafter is 6 GiB BF16, doesn't fit after the target.
  • 2026-04-30spiritbuun/Qwen3.6-27B-DFlash-GGUF lands on HF. Q8_0 drafter at 1.75 GB. VRAM math suddenly works.
  • 2026-04-30 — I build spiritbuun/buun-llama-cpp for sm_120 (CUDA 13.1 + -DGGML_CUDA_NO_VMM=ON + -DCMAKE_CUDA_ARCHITECTURES=120 + libcuda.so.1 stub link). First bench: 3.4 → 1.5 → 0.97 t/s, degrading run over run. File issue #35.
  • 2026-05-01 — spiritbuun replies: "I think this may be fixed now - can you repull and give it another try?"
  • 2026-05-04 — Rebuild with HEAD aecbbd5d (8 commits past my v0.1.0, notably cab1fb597 dflash: add p_min confidence threshold + adaptive draft length). Re-bench: 80 t/s avg.

Bench numbers

Run 1: 800 tok in 10.82s = 73.94 t/s
Run 2: 800 tok in  9.96s = 80.31 t/s
Run 3: 800 tok in  9.41s = 85.06 t/s

Comparison on the same hardware

Backend Stack t/s avg
llama.cpp standard UD-Q4_K_XL, no spec 33-36
vLLM Turbo v0.20.0 + Sandermage Genesis + TurboQuant K8V4 + MTP n=3 88
buun-llama-cpp DFlash HEAD aecbbd5d + Q8_0 GGUF drafter 80
vLLM vanilla (different setup) 0.19.1 + AutoRound INT4 + MTP n=3 99 peak

For context: Lucebox already published DFlash on RTX 3090 24GB at 78 t/s HumanEval / 70 t/s Math500 (sm_86 Ampere) using their custom engine + BF16 z-lab drafter. Today's Lucebox PR #86 reports 218 t/s on RTX 5090 desktop 32GB. So our 80 t/s on RTX 5090M 24GB sits right between Lucebox 3090 and Lucebox 5090 desktop, on a different stack (buun fork instead of Lucebox custom).

What's actually new

  • First public DFlash result via buun-llama-cpp on sm_120 mobile (Lucebox path uses their own engine; Lucebox 5090 desktop on PR #86 used a custom build, not buun)
  • First reproduction confirming the cab1fb597 perf fix on real 24GB consumer hardware (was untested before)
  • Stack uses Q8_0 quantized drafter (not BF16) — frees enough VRAM that the math just works, no compromises elsewhere

The recipe

Image: built from spiritbuun/buun-llama-cpp master HEAD with:

cmake -B build \
  -DGGML_CUDA=ON -DGGML_CUDA_NO_VMM=ON \
  -DCMAKE_CUDA_ARCHITECTURES=120 \
  -DCMAKE_EXE_LINKER_FLAGS="-Wl,-rpath-link,/usr/local/cuda/lib64/stubs" \
  -DCMAKE_SHARED_LINKER_FLAGS="-Wl,-rpath-link,/usr/local/cuda/lib64/stubs" \
  -DCMAKE_BUILD_TYPE=Release && \
cmake --build build --target llama-server -j$(nproc)

llama-server args:

--model unsloth/Qwen3.6-27B-Q4_K_M.gguf
--model-draft spiritbuun/dflash-draft-3.6-q8_0.gguf
--spec-type dflash
--n-gpu-layers 99 --n-gpu-layers-draft 99
--ctx-size 32000 --ctx-size-draft 256
--batch-size 256 --ubatch-size 64
--parallel 1 --flash-attn on --jinja
--chat-template-kwargs '{"enable_thinking": false}'

Important: disable thinking (enable_thinking: false). spiritbuun's README notes the drafter wasn't trained on the think-wrapped distribution — leaving thinking on collapses acceptance and gives ~1.8× less throughput.

Things I haven't tried that should push 100+ t/s

  • DDTree budget tuning (Lucebox uses 22 for 218 t/s on desktop 5090, default in buun likely sub-optimal)
  • --no-fused-gdn ON vs OFF — recent buun commit 905483277 added this debug flag
  • p_min adaptive draft length sweep
  • Pushing context to 64-80K (32K is conservative)

Bonus: PFlash also lands today

While I was writing this up, u/sandropuppo posted PFlash — speculative prefill, complementary to DFlash decode. 10× faster TTFT at 128K on RTX 3090. The pflash/ dir was merged into Lucebox-hub main today. Combining DFlash decode (this post) + PFlash prefill on consumer 24GB Blackwell would close the long-context UX gap completely. Next bench session.

Worth noting: llama.cpp MTP also entered beta today

Same day, u/ilintar posted that llama.cpp MTP is in beta thanks to am17anPR #22673, tested on Qwen3.6 27B + Qwen3.6 35B-A3B with 75% acceptance at 3 draft tokens and 2× speedup over baseline. Depends on the partial seq_rm for GDN PR #22400 we needed for hybrid spec decoding. So llama.cpp now has BOTH MTP (PR #22673) AND DFlash (this post via buun fork) paths — feature parity with vLLM is closing fast.

Credits

  • spiritbuun for the fork + the Q8_0 drafter + the 24h fix turnaround
  • z-lab/dflash for the block-diffusion method
  • Lucebox for proving the 24GB consumer DFlash path on RTX 3090 first
  • unsloth for the Qwen3.6-27B Q4_K_M GGUF target

Full write-up with timestamps and all the iteration mistakes: https://airelien.dev/en/posts/dflash-27b-24gb-debloque/ (EN, FR also at /fr/posts/).

Anyone with a 5090M / 4080M / 3090 24GB who wants to reproduce, I'd love to see your numbers.

u/aurelienams — 10 days ago
▲ 7 r/Olares+1 crossposts

Spent the past few days trying to package Lucebox DFlash (134 t/s on RTX 3090) into a Docker image for my Olares One (RTX 5090 Mobile, sm_120 consumer Blackwell). Nobody had done this on Blackwell consumer or under Kubernetes/HAMi vGPU before, so a few things broke that aren't documented anywhere.

Three undocumented build fixes :

  1. libcuda.so.1 not found at link time → CUDA devel image only ships libcuda.so (no .1 suffix). Need a stub symlink and -Wl,-rpath-link,/usr/local/cuda/lib64/stubs in CMAKE_EXE_LINKER_FLAGS. LIBRARY_PATH alone doesn't work because indirect dependencies are resolved differently.
  2. Same fix needed in both cmake invocations (Lucebox dflash target + the embedded llama.cpp fork's llama-server target). Forgot one → another 1h of compile down the drain.
  3. -DGGML_CUDA_NO_VMM=ON to dodge the runtime crash described below — only a partial mitigation.

Image is on Docker Hub as aamsellem/lucebox-qwen36-blackwell:1.0.0 (and :1.1.0 with NO_VMM) if anyone wants to try.

The real find — a systemic bug in HAMi-core

After getting the image to build, the pod kept crashing at startup with [HAMI-core ERROR ...]: Illegal device id: <random> — a different random integer on every restart, which screams uninitialized stack variable. Read the source of Project-HAMi/HAMi-core and found that 6 hooks in src/cuda/memory.c and src/allocator/allocator.c all call cuCtxGetDevice(&dev) without checking the return code. When the calling thread doesn't have a CUDA context attached (which happens during ggml-cuda init paths, and more often on sm_120 because of slightly different init order), dev stays uninitialized on the stack and gets forwarded down to set_current_device_memory_limit, which logs the random ID and writes out of bounds into the shared region.

This affects anything using HAMi vGPU, not just Lucebox — recent llama.cpp llama-server with default GGML_CUDA=ON (VMM enabled since llama.cpp#11446), vLLM with async scheduling, and ComfyUI multi-thread workflows can all hit it.

Fix is upstream : HAMi-core #187 (issue) + PR #188, +36/-21 lines, two commits. Initialise dev from prop->location.id for cuMemCreate, bail out gracefully (skip the per-device tracking) for the 5 sites in allocator.c, and tighten set_current_device_memory_limit to return early instead of OOB writing.

Not merged yet. If you're running HAMi (Olares users notably ship beclab/hami:v2.6.14 from March 2026, which has the same pattern), upvotes/reviews on the PR would help speed things along.

Saga write-up : I'm publishing the whole thing as a blog series at https://airelien.dev/en/tags/saga/ — all 7 episodes are up, walking through the 6 builds, the 3 build fixes, the source dive, and the upstream PR. Plain devlog, no marketing.

Happy to share build logs, the patched libvgpu.so, or a reproducer pod manifest if anyone hit the same issue and wants to compare notes.

u/aurelienams — 16 days ago
▲ 31 r/Olares

Following u/Kindly-Cantaloupe978's 80 t/s @ 218K context post and Wasif Basharat's 85 t/s Medium write-up, I tried to reproduce on my Olares One — a small home-AI box with an RTX 5090 Laptop GPU (24GB, ~896 GB/s, sm_120 Blackwell), not the 32GB desktop card.

After several iterations: ~85-100 t/s sustained, peaks at 99.7 t/s, 75K max context, MTP n=3 with 92-95% acceptance once warm. That's roughly 3x faster than llama.cpp on the same hardware (33-36 t/s with the best NVFP4 GGUF) and matches/beats the 32GB desktop references.

TL;DR numbers

Setup Hardware t/s
llama.cpp UD-Q4_K_XL or NVFP4 GGUF RTX 5090M 24GB 33-36
vLLM v0.17 NVFP4 (no MTP) RTX 5090M 24GB 39
vLLM v0.19.1 NVFP4 + MTP n=1 RTX 5090M 24GB OOM (model OK, MTP head 2.37 GiB doesn't fit)
vLLM 0.19.1 + Lorbus AutoRound + MTP n=1 RTX 5090M 24GB 65
vLLM 0.19.1 + Lorbus AutoRound + MTP n=3 RTX 5090M 24GB 85-100
Reference: same recipe on 5090 desktop 32GB RTX 5090 32GB 78-80
Reference: Wasif's stack on 3090 24GB RTX 3090 24GB 85

Five gotchas specific to 24GB Blackwell mobile

1. NVFP4 + MTP = OOM on 24GB

I tried sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP first. NVFP4 is 2x FP8 throughput on Blackwell tensor cores, and the model name says it includes MTP. Loaded fine, but:

torch.OutOfMemoryError: Tried to allocate 2.37 GiB. GPU has 2.25 GiB free.

Same issue Wasif documents. vLLM's Qwen3_5MTP loader allocates a fresh 2.37 GiB BF16 buffer for mtp.fc because NVFP4 quantizes everything in the file. On 32GB it fits, on 24GB it doesn't.

Fix: switch to Lorbus/Qwen3.6-27B-int4-AutoRound, which dequantizes only mtp.fc to BF16 in the file (~280 MiB). vLLM finds it on disk, no fresh buffer.

Trade-off: AutoRound INT4 uses Marlin kernels (Ampere-tuned) instead of native NVFP4 tensor cores. But MTP n=3 brings way more speed than NVFP4 acceleration would have on a bandwidth-bound consumer card.

2. --kv-cache-dtype fp8_e5m2 rejected with NVFP4 checkpoints

ValueError: fp8_e5m2 kv-cache is not supported with fp8 checkpoints.

AutoRound INT4 isn't FP8 family, so fp8_e5m2 works there. Bonus: it gives more KV pool than fp8_e4m3 (Olares One ends up with 23,760 cached tokens with fp8_e5m2 + gpu-mem-util 0.97).

3. PR vllm#36325 (Blackwell TMA fix) is mandatory on sm_12x

Without it, Triton autotuner OOMs at warmup. is_tma_supported returns True for any compute capability ≥9 but Blackwell consumer doesn't really do TMA — descriptor buffer allocations blow up VRAM. PR caps at &lt; 12. 4-line patch I cherry-picked into a custom image.

4. patch_tolist_cudagraph.py is now public

The previously-private patch from Wasif's article is now in noonghunna/qwen36-27b-single-3090/patches/. 165 lines, fixes a .tolist() CPU sync that breaks CUDA graph capture during warmup's continuation-chunk simulation when spec-decode + chunked-prefill combine. Required even with fp8 KV (not just TurboQuant).

5. MTP n=3 actually fits on 24GB with Lorbus

I expected n=3 to OOM (Wasif's article warns about it on 24GB with sakamakismile). With Lorbus's dequantized mtp.fc and --gpu-memory-utilization 0.97, n=3 fits fine. Acceptance length peaks at 3.86/3.0 (98%/96%/92% per-position), generation throughput peaks at 99.7 t/s.

The recipe

Custom Docker image (FROM vllm/vllm-openai:v0.19.1-cu130):

  • Apply vllm-project/vllm#36325.diff at build time
  • Mount patch_tolist_cudagraph.py and run it before vllm serve via entrypoint wrapper

vLLM args:

--model Lorbus/Qwen3.6-27B-int4-AutoRound
--quantization auto_round
--dtype float16
--attention-backend flashinfer
--kv-cache-dtype fp8_e5m2
--max-model-len 75000
--gpu-memory-utilization 0.97
--max-num-seqs 1
--max-num-batched-tokens 2048
--language-model-only
--enable-prefix-caching
--enable-chunked-prefill
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--reasoning-parser qwen3
--speculative-config '{"method":"mtp","num_speculative_tokens":3}'

Env:

VLLM_USE_FLASHINFER_SAMPLER=1
VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
VLLM_MARLIN_USE_ATOMIC_ADD=1
VLLM_FLOAT32_MATMUL_PRECISION=high
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,max_split_size_mb:512
NCCL_CUMEM_ENABLE=0
NCCL_P2P_DISABLE=1
OMP_NUM_THREADS=1
CUDA_DEVICE_MAX_CONNECTIONS=8

Live metrics (steady state)

Avg generation throughput: 85-100 t/s (variance with content)
Peak: 99.7 t/s
Mean acceptance length: 3.20 → 3.86 (out of 3 max)
Per-position acceptance: 98%/93%/88%
Avg draft acceptance rate: 92-95%
Model loading: 16.87 GiB
KV pool: 23,760 tokens (3.24 GiB)
KV cache usage during generation: 21-31%

Notes

  • Variance: speeds drop to 65-70 t/s on creative/transition text where MTP acceptance falls to ~70%, climb back to 95+ t/s on predictable patterns (boilerplate code, structured output). Same "MTP variance" Wasif documents.
  • Why we beat the 32GB references: probably the combination of Lorbus + flashinfer + chunked-prefill at n=3 lands well, and the laptop card's lower bandwidth is masked by the high MTP acceptance. Bandwidth math: 60% of desktop 5090 (896 vs 1500 GB/s) → ceiling ~50 t/s without spec, ×~2 acceptance length → ~100 t/s achievable, which is what we see.
  • Could NVFP4 still help? If anyone publishes a Qwen3.6-27B NVFP4 quant with mtp.fc dequantized in the file (Lorbus-style trick applied to NVFP4 instead of AutoRound), 24GB Blackwell mobile would likely push past 100 t/s. The 2x tensor core speed would compound with MTP n=3.

Credits

Happy to share the custom Dockerfile or the Helm chart if it helps anyone running on consumer Blackwell mobile. Curious if other 5090M / 4080M / 3090 24GB owners can reproduce these numbers.

reddit.com
u/aurelienams — 19 days ago