
We squeezed 4x MoE prefill speed out of an RX 6800 XT by rewriting the matmul kernel in llama.cpp
Hey everyone,
I've been working on a fork of llama.cpp focused on making AMD GPUs first-class citizens for LLM inference. After months of profiling and kernel-level work, we just pushed v0.3.0 with some results worth sharing.
The short version: on a 35B MoE model (IQ4_XS quantized), prefill went from ~480 t/s to 1770 t/s on an RX 6800 XT. Dense models stayed flat at 480 t/s, which is expected since the optimization targets the small-matrix multiply pattern that MoE routing creates.
Why we did this:
The upstream llama.cpp treats AMD GPUs as "just another backend." The kernels are written for NVIDIA and ported over. We found that the dequantization path was leaving massive bandwidth on the table on RDNA2, and the matmul kernel for MoE models was completely memory-bound. So we went in at the HIP level.
What we shipped:
- A BFE-based dequantization kernel for IQ4_XS that runs 13x faster in isolation
- An async pipeline that overlaps dequant launches with compute, cutting kernel launch overhead by 31%
- An experimental LDS double-buffered matmul kernel that overlaps weight loading with DP4A compute. This is where the 4x gain comes from. It's behind a flag because the latency variance is still too high for production use. We know why (LDS bank conflicts on symmetric tile dimensions) and we already have the fix planned.
The experimental flag is there because we believe in shipping transparently. The gain is real, the variance is real too, and we'd rather let people benchmark it themselves than pretend it's stable.
If you're running AMD hardware and want to try it, the build scripts and benchmark harness are in the repo. No CMake changes needed.
GitHub: https://github.com/Stormrage34/llama.cpp-turboquant-hip
Happy to answer questions about the kernel work, the profiling process, or why MoE models benefit so much more than dense ones.