Tried ROCm 7.1 vs Vulkan/RADV on Radeon 890M for LLM inference (8B and 35B-MoE). Vulkan won both. Why?
Posting because I expected the opposite result and I want to know if I
misconfigured ROCm or if this is the actual state of things on Radeon 890M
class iGPUs.
Hardware: Beelink SER9 Pro, Radeon 890M iGPU (16 RDNA 3.5 CUs), 32GB
LPDDR5x-7500. Ubuntu 24.04, kernel 6.11.
Two backends tested:
ROCm 7.1 — installed via the official AMD repo. gfx1150 target (gfx1100
binary fallback because gfx1150 isn't fully supported yet). Built
llama.cpp with -DGGML_HIPBLAS=ON.
Vulkan/RADV — mesa 24.x, llama.cpp (and LMStudio for the bigger model)
built with -DGGML_VULKAN=ON.
Two workloads:
WORKLOAD A — Gemma 4 E4B Q8_0 (8B dense, full offload, 4K ctx):
- ROCm: ~12.5 tok/s
- Vulkan/RADV: ~16.0 tok/s
WORKLOAD B — Qwen 3.5 35B A3B Q4_K_M (35B MoE, 15–20 of ~48 layers offloaded,
4–8K ctx):
- ROCm: ~14 tok/s (had to fight harder to get this working with partial
offload — LMStudio's ROCm path on gfx1150 was less stable than its
Vulkan path)
- Vulkan/RADV via LMStudio: 20–22 tok/s steady
In both cases, same machine, same model file, same prompt. Power and
thermals were similar between backends — this is throughput, not
heat-throttling.
My read on why:
- gfx1150 (RDNA 3.5) doesn't have first-class kernel support in ROCm 7.1
yet. Falling back to gfx1100 binaries leaves perf on the table.
- The Vulkan backend in upstream llama.cpp got Wave32 flash-attention
+ graphics-queue scheduling patches in early 2026 that haven't landed
in the ROCm path yet.
- For the 890M's iGPU class specifically, the integrated nature means
memory bandwidth dominates, and Vulkan's path through RADV seems better
optimized for shared LPDDR5x access patterns.
- For partial offload specifically, Vulkan handles the GPU-CPU layer
boundary cleaner in LMStudio than ROCm did.
Open questions for the sub:
- Anyone running gfx1150-targeted ROCm builds (not gfx1100 fallback)?
Does perf shift?
- Is the picture different at the Strix Halo 8060S iGPU class? More CUs,
more bandwidth, possibly closer ROCm parity.
- ROCm build flag I'm missing for this iGPU class?
Not trying to dunk on ROCm — I want to use it for the unified-memory story
on iGPUs, but Vulkan is faster on this class today. Curious if that flips
with ROCm 8.x or with bigger silicon.