r/Vllm

▲ 4 r/Vllm+1 crossposts

DGX Spark + vLLM: 2x NVFP4 models for an internal AI platform. Does this architecture make sense?

I'm building an internal AI platform for my company (around 10-20 active users).

Current stack:

- DGX Spark (128GB Unified Memory)
- LiteLLM as the gateway/router
- vLLM (2 separate instances)
- Open WebUI
- VS Code (Copilot Chat + MCP)

Planned models:

Chat / Review
- nvidia/Gemma-4-26B-A4B-NVFP4

Coding / Agent
- nvidia/Qwen3.6-35B-A3B-NVFP4

Current idea:

- Both vLLM instances support long context (128K-262K).
- LiteLLM routes requests based on workload and prompt size.
- Most coding/chat requests are limited to 32K context.
- Long-context requests are allowed only when necessary.

Typical routing:

- Chat -> Gemma
- Code generation -> Qwen
- Code review -> Gemma
- Large repo/document analysis -> Long-context model

The goal isn't benchmarking. I care more about:
- Low latency
- Good concurrency
- Stable production behavior
- Efficient GPU memory usage

Questions:

  1. Would you run both models with 128K-262K max_model_len, or create separate "fast (32K)" and "long-context" vLLM instances?
  2. Any recommended vLLM tuning for DGX Spark (gpu-memory-utilization, max-num-seqs, batched tokens, chunked prefill, speculative decoding, etc.)?
  3. Has anyone benchmarked these NVFP4 models under concurrent real-world workloads (agent + MCP + coding), not just single-user token/sec?

I'd love to hear any production experience or lessons learned.

reddit.com
u/illNin0 — 1 day ago
▲ 1 r/Vllm+1 crossposts

I've created the Repairable AI Interchange Format for structured data that saves 10% tokens using vLLM plugin

I've recently been curious about how well JSON fits the LLM nature at all. Is it optimized for non-deterministic processes? I thought there should be a better approach and better format that can handle the LLM quirks and just be more efficient for LLMs instead of being made for humans.

And I've created RAIF. It's not only standard, but a multi-level system that has standard, LoRAs, and vLLM plugin.

On the benchmark that has a lot of different types of JSON, the avg token saving is 10%, but this number fluctuates based on the JSON type and tokenizer type. One of the best performing types is when JSON has a lot of repetitive data, and in this case, savings went up to 70% of tokens.

The coolest part in my opinion is that this RAIF thing is compatible with all the existing clients and harnesses, because the vLLM plugin converts RAIF that's an LLM's output to JSON deterministically before it reaches the client, so you're getting a fully compatible API as it was before.

The only problem I got is response_format streaming. It outputs plain RAIF, only if you turn off the streaming, the response_format will become JSON.

Also, for it to work, I've fine-tuned some models using LoRAs and created 3 of them for now:

Even the tiniest model performs nicely with RAIF.

There's an option to run RAIF without a plugin, so clients will get pure RAIF instead of JSON. For this, I've created light Python and TypeScript packages without any dependencies.

Right now, I really want to get the feedback, and I want to see how well it fits the existing needs. It's more like an experiment for self-education, so idk if there's a real use case, that's why I'm calling for the community's help.

https://reddit.com/link/1ugymq4/video/b7glacjcus9h1/player

reddit.com
u/truehaZker — 1 day ago
▲ 12 r/Vllm

Qwen3.6-27B-FP8 with vllm:nightly, opencode unusable?

Hey all,

I'm at my wits end here, hoping someone might have some answers.

When using opencode (or forks like kilocode), after making some tool calls, inference on the backend stops, and opencode just waits until timeout.

I'm running on 4 RTX 8000s (SM75). I've tried all the chat templates, the coder and xml tool call parser, disabling reasoning, swapping between DFlash and MTP, but nothing seems to solve this issue.

Could this be a triton bug or something silly like that? I've had access to other, newer hardware at work that doesn't seem to display the same issue on flash based kernels.

reddit.com
u/waka324 — 2 days ago
▲ 0 r/Vllm

Built a small tool to help choose GPUs for vLLM deployments. Looking for feedback.

I have been spending a lot of time figuring out which GPUs can actually run different models without manually checking VRAM, quantization, and pricing across providers.

So I built a small tool that recommends compatible GPUs and compares options across providers and ranks the best fit basis an algorithm we built in the backend.

It's still very early, and I will really appreciate feedback from people actually using vLLM.

What would make something like this genuinely useful in your workflow?

https://stratuspilot.io

reddit.com
u/Major_Border149 — 3 days ago
▲ 25 r/Vllm+1 crossposts

Got GLM-5.2 + MTP speculative decode running on 4× DGX Spark (GB10) — and the build piece the public recipe is missing

TL;DR: the recipe's image-build mods aren't actually public – I reconstructed them from the public kernels (with Claude) – and you have to build vLLM at the author's exact pinned ref or the real AWQ weights crash on load. Running now at ~9.4 tok/s on my own 4× GB10.

Saw a link on X to CosmicRaisins' GLM-5.2 stack for 4× GB10: vLLM TP=4, MTP speculative decode, ported sparse-MLA Triton kernels (the Hopper-only _flashmla_C path doesn't exist on sm_121), and a data-free 15% expert prune so the AWQ-INT4 weights fit. Great work. I'd actually tried vanilla vLLM for GLM-5.2 on these boxes months ago and it fell over around 512-token context, so I'd been serving it on llama.cpp RPC (~5 tok/s) instead – a working sparse-MLA

MTP path was exactly what I'd been after. Porting it to my own 4-node Spark cluster, I hit two walls worth sharing:

  1. The image isn't reproducible from the public repo. The README points at two vLLM mods in a spark-vllm-docker fork, but they aren't actually published (only the kernels are). So I reconstructed them from the public kernels – a single build-recon-image.sh that bakes the kernels in, patches deep_gemm.py (route the 3 DSA fns to the sm12x_* fallbacks on the sm_120/121 family, before the _missing() gate) and sparse_attn_indexer.py (drop the has_deep_gemm gate on sm12x), auto-applies the flashmla→Triton monkeypatch, and pip install b12x==0.23.0. The wiring validates with a quick import check on the GPU.

  2. The base vLLM ref really matters. Building on a newer vLLM than the author's pinned commit made the real AWQ weights crash at process_weights_after_loading (_k_scale.fill_ → async CUDA error: invalid argument). Dummy weights loaded fine, so it was specific to real-weight processing. Rebuilding vLLM at the author's exact ref fixed it instantly. If you port this: pin the ref.

Other port notes: you can skip the 378 GB weight download – the 15% prune is deterministic from the cyankiwi AWQ base via the repo's awq_surgery.py (~20 min, pure safetensors surgery). On nodes with less free memory, gpu-memory-utilization 0.93 trips the boot guard – drop to 0.90 + lower max-model-len. No shared FS? NFS-export the weights from the head. And set the RoCE HCA/GID-index for your fabric.

Result: serving fine, coherent output, ~9.4 tok/s decode on a single RoCE rail – roughly 2× the llama.cpp fallback it replaced (MTP acceptance ~2.8/4). The author gets ~20 with dual-rail – the inter-node allreduce bandwidth is the decode bottleneck, so the 2nd rail is the ~2× lever (still debugging NCCL dual-rail GID resolution on mine).

Full notes + my fork + the reconstruction script: https://github.com/anvarazizov/glm-5.2-gb10

Huge credit to CosmicRaisins for the kernels/prune/MTP work — this is just the integration glue to make it portable. Would love for the maintainer to vendor the build script so nobody else has to reverse-engineer it.

reddit.com
u/anvarazizov — 4 days ago
▲ 55 r/Vllm+4 crossposts

Free open-source LLM inference handbook : 100+ clones in week 1

Hi everyone, I'm writing a practitioner's handbook on LLM inference in public, on GitHub.

When I started working on LLM serving infrastructure, I couldn't find a single resource that covered the full picture: the memory bandwidth math, the prefill/decode asymmetry, KV cache management, continuous batching, speculative decoding, quantization tradeoffs, all in one place, with real numbers.

Plenty of great blog posts cover individual topics well. But nothing tied them together into a coherent mental model for someone building inference systems end to end. So I started writing it. Chapter by chapter, in the open, with the math shown.

Foundations chapter 00 is ready, hope it helps.

The plan:

- A new chapter every week with practical notebooks

- All source on GitHub, open to issues and corrections

- A companion Substack newsletter for each chapter. Link is in Github README.

If you're an engineer working on LLM infrastructure, or thinking about it, this might be a good resource for you.

github.com/harshuljain13/llm-inference-at-scale

u/YouFirst295 — 8 days ago
▲ 25 r/Vllm+2 crossposts

2× Radeon AI PRO R9700 (RDNA4/gfx1201) on vLLM 0.22.1 — how we fixed the long-context decode cliff (and what we learned chasing FP8)

Posting our setup for the (apparently growing) club of people running multiple R9700s on vLLM. Big shout-out to u/AustinM731 — their AITER Unified Attention post was the single most useful thing we found, and I want to (a) confirm it works, (b) share where our findings lined up vs differed, and (c) save the next person the week we spent going down dead ends.

The rig

  • GPUs: 2× AMD Radeon AI PRO R9700 (gfx1201 / RDNA4, 32 GB each), TP=2
  • Board/CPU: ASRock X870E, Ryzen, 60 GB RAM
  • OS: Fedora 44 Server, kernel 7.0.11 (the ~100 W idle-draw bug is fixed in 7.0 — already not an issue for us)
  • Model: Qwen3.6-35B-A3B-FP8 (the 35B hybrid Gated-DeltaNet + attention MoE, ~3B active), native 262K context
  • Serving: MTP speculative decoding (n=3), AITER Unified Attention, bf16 KV cache, TunableOp, --enable-chunked-prefill

Exact versions (so people know what this is on)

GPU arch     : gfx1201 (RDNA4) ×2, TP=2
OS / kernel  : Fedora Linux 44 (Server), kernel 7.0.11-200.fc44
vLLM         : 0.22.1
ROCm / HIP   : 7.2.x (torch.version.hip = 7.2.53211)
PyTorch      : 2.10.0 (+git8514f05)
Triton       : 3.6.0
AITER        : present (gfx1201 gate relaxed; see below)
base image   : vllm/vllm-openai-rocm:v0.22.1  (we run a committed image with 2 one-line patches)
runtime      : podman + systemd (--user), --ipc=host, NCCL_PROTO=Simple, ROCR_VISIBLE_DEVICES=0,1

Note on versioning: vLLM moves fast and the gfx1201 gates change between releases. On 0.22.1 the AITER unified-attention backend is already built in (just gated to CDNA). On the 0.19/0.20 images others used, you had to rebuild. So your patch surface depends heavily on your vLLM version — worth stating yours when you compare numbers.

The thing that actually mattered: the long-context decode cliff

For ages we only ever benchmarked at ~8K context and were happy (~100+ tok/s). Then we benchmarked deep, and decode fell off a cliff:

context ROCm prefill-decode attn (before)
~8K ~100 tok/s
~21K 56
~79K 14

That ~7× collapse is not normal memory-bandwidth decay — it was the unoptimized ROCm attention path on gfx1201 scaling badly. The fix is exactly what u/AustinM731 found: AITER Unified Attention (ROCM_AITER_UNIFIED_ATTN).

On vLLM 0.22.1 the backend is already compiled in — it's just gated to CDNA (MI300/MI350). Relax one gate and select it:

  • In vllm/_aiter_ops.py, is_aiter_found_and_supported() returns on_mi3xx(). Make it also allow gfx1x: return on_mi3xx() or bool(getattr(_rocmmod, "_ON_GFX1X", False))
  • Run with --attention-backend ROCM_AITER_UNIFIED_ATTN, VLLM_ROCM_USE_AITER=1, and turn the others off (VLLM_ROCM_USE_AITER_MHA=0, _PAGED_ATTN=0, _MOE=0, _LINEAR=0) — those have no gfx1201 kernel and will crash MoE init otherwise. Plus FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE.
  • It auto-sets KV block size to 64 (power-of-2), which sidesteps the AITER TILE_SIZE assert on the Qwen3.6 hybrid layout.

Result (Qwen3.6-35B-A3B-FP8, TP2, MTP3, bf16 KV) — strictly faster at every depth, gap widens with context:

context before AITER unified
~8.7K ~100 136
~21K 56 83
~79K 14 41 (≈3×)
~118K collapsed 30

Quality unchanged (still bf16 KV). For a context-filling coding agent this was night and day.

How our findings compared to u/AustinM731's post

Confirmed / same:

  • AITER Unified Attention is THE long-context fix on gfx1201. Relaxing the CDNA gate to include RDNA4 is the move.
  • MTP=3 is the sweet spot (~84% draft acceptance for us, free single-stream speed).
  • That fast attention path is bf16/fp16 KV only — you can't pair it with FP8 KV.
  • The 100 W idle issue is fixed in kernel 7.0.

Different / what we'd add:

  • Newer vLLM = less patching. They were on 0.19.1/0.20.2 and rebuilt images; on 0.22.1 the unified-attn backend already ships — it's a one-line Python gate relax + the --attention-backend flag. No full rebuild.
  • TP=2 on hybrid models needs the GDN-KKT fix. vLLM ≥0.21 mis-compiles the Gated-DeltaNet chunk_scaled_dot_kkt Triton kernel on gfx1201 (a Hopper WGMMA layout change, #42076) → TP≥2 hangs at startup with a misleading shm_broadcast timeout. One-line revert of that operand layout on non-CUDA fixes it. If you run Qwen3.6/Qwen3-Next hybrids on TP2, you probably need this.
  • We went deep on FP8 KV and concluded it's a dead end on gfx1201 — skip it. The 262K-context dream via FP8 KV isn't worth it: the stock vLLM fp8 decode kernel does a per-element fp32 dequant that's ~3× slower; we wrote a kernel patch (fold the scalar scale → cast to bf16) that got it 34→41.5 tok/s, and even probed native fp8 WMMA (compiles on RDNA4!) and int32-packed loads — none beat bf16, and AITER unified requires bf16 KV anyway. Qwen3.6's KV footprint is tiny, so just run bf16.
  • The HIP "custom paged attention" kernel is unreachable for this model. It's hard-gated off for hybrid GDN models (stride-padded KV layout → has_native_kv_cache_layout is false), so even bf16 falls back to Triton. Don't chase it for Qwen3.6.
  • Context headroom: with bf16 KV our pool is ~768K tokens, so at the model's native 262K you still get ~2.9× concurrency. No need for FP8 KV to reach max context.
  • 2 GPUs vs their 4: our single-stream decode holds ~30 tok/s at 118K (they hold higher on 4×). Long-context decode scales with how much compute/bandwidth you can throw at it.

TL;DR config for gfx1201 + Qwen3.6 on vLLM 0.22.1

  • Patch 1: revert #42076 operand layout on non-CUDA (GDN-KKT) → TP2 works
  • Patch 2: allow ROCM_AITER_UNIFIED_ATTN on gfx1x in _aiter_ops.py
  • Flags: --attention-backend ROCM_AITER_UNIFIED_ATTN, AITER on but MHA/paged/MoE/linear off, MTP n=3, bf16 KV, TunableOp, chunked prefill
  • Don't bother with FP8 KV.

Happy to share the exact patches/compose if anyone wants them. Thanks again to u/AustinM731 — the unified-attention tip was the unlock.

reddit.com
u/whodoneit1 — 9 days ago
▲ 55 r/Vllm+5 crossposts

Profile v2: A physics-grounded, cost-aware optimizer for vLLM

A simple CLI tool that helps you to fine tune your vLLM server.

Profile deeply scans your inference engine (vLLM to begin with), and GPU, calculates your HW limits using Math, & uses metrics from vLLM to give you the waste, its cause, and finally tips to fix it.

It does not stop there, it waits for you to apply the tips, and then keep on re-iterating, until you AI server is tuned to get max out of its limits, or there are no more issues.

A closed loop optimizer for vLLM.

Github: https://github.com/jungledesh/profile
Live Demo + Docs: https://jungledesh.github.io/profile/index.html

I'd love to have any feedback, and answer any q's / concerns.

u/Inevitable-Diet-1870 — 10 days ago
▲ 5 r/Vllm

Trying to get vLLM off the ground...

Let me apologize up front - I am not a system admin or LLM architect. I'm a hobbyist who has a small home AI server that I'm expanding into something a little more than a personalized chatbot. Up until now, I've been using ollama, and for my purposes, it has been working just fine. However, I'm starting to explore more agentic uses for my AI server, and ollama just isn't making the grade.

For context, I'm running a GekTec EVO-X2 (AMD Ryzen AI Max+ 395 with Radeon 8060S (gfx1151) - 128GB total memory, 32GB System, 96GB VRAM). I am running ROCm 7.2.4 drivers. I've set up a Docker container for vLLM, and it appears to start, but it has an issue with the model I've pulled locally. I'm trying to run a Qwen 3.6 27B Uncensored model I've pulled from Huggingface (gguf), and the logs say that "qwen35" is not a supported gguf model architecture yet. Looking at the model's page on HF, I do not see available files for config, tokenizer, safetensors, etc.

Should I have the container pull the model from HF on startup? If so, is there a way for me to control where on my system the model gets stored?

Again, I'm very new to this space and learning as I go, so please go easy on me if I'm not seeing the obvious answer here. Thank you.

reddit.com
u/GuitarEC — 9 days ago
▲ 31 r/Vllm+5 crossposts

How LLM inference actually works at scale — a breakdown for anyone learning ML systems

One thing that confused me early on: I understood how LLMs are trained, but had no idea how they actually serve millions of requests efficiently.

Here's a quick breakdown of the key concepts:

Why inference is harder than it looks

A user sends a prompt → the model returns tokens. Simple on the surface. But underneath, the system is managing GPU memory, scheduling thousands of concurrent requests, and generating tokens one at a time in a loop.

KV Cache — every time the model generates a token, it needs to remember the context of everything before it. This is stored in a KV (key-value) cache. For long conversations, this cache can consume more GPU memory than the model weights themselves.

Continuous Batching — naively, you'd process one request at a time. Modern systems batch many requests together and schedule at the token level — finished requests leave the batch, new ones enter. This keeps the GPU busy and dramatically improves throughput.

Tensor Parallelism — when a model is too large for one GPU, you split it across multiple GPUs. Each GPU holds a shard of the weight matrices and they communicate during the forward pass.

The most important insight: there isn't one way to "scale" inference. High traffic needs replicas. Large models need tensor parallelism. Low GPU utilization needs better scheduling. Long contexts need KV cache management. Picking the wrong solution for the wrong bottleneck wastes money and doesn't fix the problem.

I've been writing a deep-dive series on all of this — just published Part 6 on parallelism strategies with hands-on experiments and code if anyone wants to go deeper:

https://pawankjha.substack.com/p/architecting-llm-inference-part-6

Happy to answer questions on any of this in the comments!

u/ArchitectingAI — 9 days ago
▲ 40 r/Vllm+2 crossposts

4 RTX 6000 Pro

Hey everyone,
I’m getting a 4-GPU node with RTX 6000 Pro Blackwell Edition cards, and I’m trying to estimate how many simultaneous users I could realistically support when serving MiniMax M2.7 NVFP4.
I’m planning to use FP8 KV cache, and likely LMCache backed by DDR5 RAM to help with longer-context workloads.
I’m trying to decide on the best parallelism strategy for this setup:
TP=2 + PP=2
TP=4
Any other recommended configuration?
The main things I’m trying to understand are expected concurrency, TTFT, tokens/sec, and whether pipeline parallelism makes more sense than full tensor parallelism on this kind of node.
I’d really appreciate any insights from people who have deployed something similar, especially with vLLM, LMCache, NVFP4 models, or RTX 6000 Blackwell GPUs.
Thanks!

reddit.com
u/Some-Manufacturer-21 — 11 days ago
▲ 18 r/Vllm

I got tired of not understanding how vLLM works under the hood, so I built my own mini inference engine from scratch.

Hey everyone,

I’ve been using stuff like vLLM and TensorRT-LLM for a while, but I realized I didn't actually understand how they were squeezing so much performance out of GPUs. To figure it out, I decided to just build one myself.

I put together an open-source project called TokenForge. It’s not another API wrapper or chatbot UI—it’s a barebones inference engine built from the ground up in PyTorch and CUDA just to test out how all these optimization tricks actually work.

Here’s what I ended up implementing:

  • Continuous Batching: Instead of waiting for a whole batch to finish, it injects new requests on the fly so the GPU is never just sitting there idle.
  • Paged KV Cache: I essentially copied the homework of PagedAttention to stop VRAM fragmentation. It pre-allocates memory blocks so you don't randomly run out of memory.
  • Custom Kernels: Standard PyTorch was bottlenecking things, so I wrote some raw C++/CUDA and Triton kernels (including a custom Flash-Attention setup).
  • Speculative Decoding: Paired the main model with a tiny draft model to guess tokens ahead of time and speed up generation.
  • A Live Dashboard: Hooked everything up to a FastAPI backend with a UI that tracks VRAM, power draw, and tokens/sec in real-time.

Honestly, this was mostly a massive learning project for me to understand memory bandwidth and GPU scheduling. If you're curious about how LLMs actually run at a low level, or just want to roast my C++ code, I'd love some feedback!

https://github.com/prathamsingh404/TokenForge-GPU-Accelerated-LLM-Inference-Research-Platform

reddit.com
u/Top-Ear-1161 — 9 days ago
▲ 12 r/Vllm

2× Radeon AI PRO R9700 (RDNA4/gfx1201) on vLLM 0.22.1 — how we fixed the long-context decode cliff (and what we learned chasing FP8)

Posting our setup for the (apparently growing) club of people running multiple R9700s on vLLM. Big shout-out to u/AustinM731 — their AITER Unified Attention post was the single most useful thing we found, and I want to (a) confirm it works, (b) share where our findings lined up vs differed, and (c) save the next person the week we spent going down dead ends.

# The rig

* **GPUs:** 2× AMD Radeon AI PRO R9700 (gfx1201 / RDNA4, 32 GB each), TP=2
* **Board/CPU:** ASRock X870E, Ryzen, 60 GB RAM
* **OS:** Fedora 44 Server, **kernel 7.0.11** (the \~100 W idle-draw bug is fixed in 7.0 — already not an issue for us)
* **Model:** Qwen3.6-35B-A3B-FP8 (the 35B hybrid Gated-DeltaNet + attention MoE, \~3B active), native 262K context
* **Serving:** MTP speculative decoding (n=3), AITER Unified Attention, **bf16 KV cache**, TunableOp, `--enable-chunked-prefill`

# Exact versions (so people know what this is on)

GPU arch : gfx1201 (RDNA4) ×2, TP=2
OS / kernel : Fedora Linux 44 (Server), kernel 7.0.11-200.fc44
vLLM : 0.22.1
ROCm / HIP : 7.2.x (torch.version.hip = 7.2.53211)
PyTorch : 2.10.0 (+git8514f05)
Triton : 3.6.0
AITER : present (gfx1201 gate relaxed; see below)
base image : vllm/vllm-openai-rocm:v0.22.1 (we run a committed image with 2 one-line patches)
runtime : podman + systemd (--user), --ipc=host, NCCL_PROTO=Simple, ROCR_VISIBLE_DEVICES=0,1

Note on versioning: vLLM moves fast and the gfx1201 gates change between releases. On **0.22.1** the AITER unified-attention backend is already built in (just gated to CDNA). On the 0.19/0.20 images others used, you had to rebuild. So your patch surface depends heavily on your vLLM version — worth stating yours when you compare numbers.

# The thing that actually mattered: the long-context decode cliff

For ages we only ever benchmarked at \~8K context and were happy (\~100+ tok/s). Then we benchmarked *deep*, and decode fell off a cliff:

context ROCm prefill-decode attn (before)
\~8K \~100 tok/s
\~21K 56
\~79K **14**

That \~7× collapse is **not** normal memory-bandwidth decay — it was the unoptimized ROCm attention path on gfx1201 scaling badly. The fix is exactly what u/AustinM731 found: **AITER Unified Attention** (`ROCM_AITER_UNIFIED_ATTN`).

On vLLM 0.22.1 the backend is already compiled in — it's just gated to CDNA (MI300/MI350). Relax one gate and select it:

* In `vllm/_aiter_ops.py`, `is_aiter_found_and_supported()` returns `on_mi3xx()`. Make it also allow gfx1x: `return on_mi3xx() or bool(getattr(_rocmmod, "_ON_GFX1X", False))`
* Run with `--attention-backend ROCM_AITER_UNIFIED_ATTN`, `VLLM_ROCM_USE_AITER=1`, and **turn the others off** (`VLLM_ROCM_USE_AITER_MHA=0`, `_PAGED_ATTN=0`, `_MOE=0`, `_LINEAR=0`) — those have no gfx1201 kernel and will crash MoE init otherwise. Plus `FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE`.
* It auto-sets KV block size to 64 (power-of-2), which sidesteps the AITER TILE\_SIZE assert on the Qwen3.6 hybrid layout.

Result (Qwen3.6-35B-A3B-FP8, TP2, MTP3, bf16 KV) — strictly faster at every depth, gap widens with context:

context before **AITER unified**
\~8.7K \~100 **136**
\~21K 56 **83**
\~79K 14 **41** (≈3×)
\~118K collapsed **30**

Quality unchanged (still bf16 KV). For a context-filling coding agent this was night and day.

# How our findings compared to u/AustinM731's post

**Confirmed / same:**

* AITER Unified Attention is THE long-context fix on gfx1201. Relaxing the CDNA gate to include RDNA4 is the move.
* MTP=3 is the sweet spot (\~84% draft acceptance for us, free single-stream speed).
* That fast attention path is **bf16/fp16 KV only** — you can't pair it with FP8 KV.
* The 100 W idle issue is fixed in kernel 7.0.

**Different / what we'd add:**

* **Newer vLLM = less patching.** They were on 0.19.1/0.20.2 and rebuilt images; on 0.22.1 the unified-attn backend already ships — it's a one-line Python gate relax + the `--attention-backend` flag. No full rebuild.
* **TP=2 on hybrid models needs the GDN-KKT fix.** vLLM ≥0.21 mis-compiles the Gated-DeltaNet `chunk_scaled_dot_kkt` Triton kernel on gfx1201 (a Hopper WGMMA layout change, #42076) → TP≥2 hangs at startup with a misleading shm\_broadcast timeout. One-line revert of that operand layout on non-CUDA fixes it. If you run Qwen3.6/Qwen3-Next hybrids on TP2, you probably need this.
* **We went deep on FP8 KV and concluded it's a dead end on gfx1201 — skip it.** The 262K-context dream via FP8 KV isn't worth it: the stock vLLM fp8 decode kernel does a per-element fp32 dequant that's \~3× slower; we wrote a kernel patch (fold the scalar scale → cast to bf16) that got it 34→41.5 tok/s, and even probed native fp8 WMMA (compiles on RDNA4!) and int32-packed loads — none beat bf16, and AITER unified requires bf16 KV anyway. Qwen3.6's KV footprint is tiny, so just run bf16.
* **The HIP "custom paged attention" kernel is unreachable for this model.** It's hard-gated off for hybrid GDN models (stride-padded KV layout → `has_native_kv_cache_layout` is false), so even bf16 falls back to Triton. Don't chase it for Qwen3.6.
* **Context headroom:** with bf16 KV our pool is \~768K tokens, so at the model's native 262K you still get \~2.9× concurrency. No need for FP8 KV to reach max context.
* **2 GPUs vs their 4:** our single-stream decode holds \~30 tok/s at 118K (they hold higher on 4×). Long-context decode scales with how much compute/bandwidth you can throw at it.

# TL;DR config for gfx1201 + Qwen3.6 on vLLM 0.22.1

* Patch 1: revert #42076 operand layout on non-CUDA (GDN-KKT) → TP2 works
* Patch 2: allow `ROCM_AITER_UNIFIED_ATTN` on gfx1x in `_aiter_ops.py`
* Flags: `--attention-backend ROCM_AITER_UNIFIED_ATTN`, AITER on but MHA/paged/MoE/linear off, MTP n=3, bf16 KV, TunableOp, chunked prefill
* Don't bother with FP8 KV.

Happy to share the exact patches/compose if anyone wants them. Thanks again to u/AustinM731 — the unified-attention tip was the unlock.

reddit.com
u/whodoneit1 — 9 days ago
▲ 3 r/Vllm

what broke first when your setup got real traffic?

I’m curious about actual vLLM serving pain, not benchmark numbers.

When you moved from “it runs” to “people/jobs are actually hitting the endpoint,” what was the first thing that broke?

Was it:

  • OOM
  • TTFT
  • throughput
  • prefix caching not helping as much as expected
  • long-context requests killing everyone else
  • bad batching settings
  • cold starts
  • OpenAI-compatible endpoint weirdness
  • multi-GPU / tensor parallel issues
  • logs not making the bottleneck obvious

Would be useful to see hear some configs. ex max batched tokens, max model len, rough concurrency etc

reddit.com
u/michaelmanleyhypley — 11 days ago
▲ 3 r/Vllm+2 crossposts

vllm vs llama.cpp vs ollama vs sglang

whats your take?

do you manage to get single developer/person workflows spawning subagents to gain from the parallel-optimized engines?

from:

https://github.com/murataslan1/local-ai-coding-guide/blob/main/guides/runner-comparison.md

Are you a single developer on desktop?

├─ Yes → Do you want simplicity? → Ollama

│ Want fine control? → llama.cpp

└─ No → Running a team server?

├─ High throughput needed → vLLM

└─ Structured JSON outputs → SGLang

u/rs38 — 12 days ago
▲ 16 r/Vllm

vLLM vs llama.cpp vs olama

I thought I would share some benchmarking results I made with gpt-oss:20b and gemma-4-26b-qat AI models. I'm using very budget setup (2 x rtx 5060 Ti 16GB).

Full article: Benchmarking AI Models | personal wiki

gpt-oss:20b

Edit: decided to repeat gpt-oss:20b test (on new hardware). Also, added sglang for comparison.

https://preview.redd.it/y0af28s4o28h1.png?width=1487&format=png&auto=webp&s=33ca2cf1456f924e05c88f363ab9f20283b12cd8

gemma-4-26b-qat

https://preview.redd.it/lr5r2t2ywg7h1.png?width=1252&format=png&auto=webp&s=c219444abdee9512484a805bac4e6a2784864a01

reddit.com
u/Outrageous-Nobody-87 — 13 days ago
▲ 2 r/Vllm+1 crossposts

Which client do you usually use to test different VLMs?

I found it surprisingly hard to find good benchmarks for evaluating AI agent transcription and meeting-summary workflows, so I built this (comment)

I’m curious whether others here have found better benchmark suites, evaluation methods, or open-source tools for comparing agent performance in this space.

reddit.com
u/PeriniM_98 — 14 days ago