u/CryptoStef33

We squeezed 4x MoE prefill speed out of an RX 6800 XT by rewriting the matmul kernel in llama.cpp
▲ 45 r/ROCm

We squeezed 4x MoE prefill speed out of an RX 6800 XT by rewriting the matmul kernel in llama.cpp

Hey everyone,

I've been working on a fork of llama.cpp focused on making AMD GPUs first-class citizens for LLM inference. After months of profiling and kernel-level work, we just pushed v0.3.0 with some results worth sharing.

The short version: on a 35B MoE model (IQ4_XS quantized), prefill went from ~480 t/s to 1770 t/s on an RX 6800 XT. Dense models stayed flat at 480 t/s, which is expected since the optimization targets the small-matrix multiply pattern that MoE routing creates.

Why we did this:

The upstream llama.cpp treats AMD GPUs as "just another backend." The kernels are written for NVIDIA and ported over. We found that the dequantization path was leaving massive bandwidth on the table on RDNA2, and the matmul kernel for MoE models was completely memory-bound. So we went in at the HIP level.

What we shipped:

- A BFE-based dequantization kernel for IQ4_XS that runs 13x faster in isolation

- An async pipeline that overlaps dequant launches with compute, cutting kernel launch overhead by 31%

- An experimental LDS double-buffered matmul kernel that overlaps weight loading with DP4A compute. This is where the 4x gain comes from. It's behind a flag because the latency variance is still too high for production use. We know why (LDS bank conflicts on symmetric tile dimensions) and we already have the fix planned.

The experimental flag is there because we believe in shipping transparently. The gain is real, the variance is real too, and we'd rather let people benchmark it themselves than pretend it's stable.

If you're running AMD hardware and want to try it, the build scripts and benchmark harness are in the repo. No CMake changes needed.

GitHub: https://github.com/Stormrage34/llama.cpp-turboquant-hip

Happy to answer questions about the kernel work, the profiling process, or why MoE models benefit so much more than dense ones.

u/CryptoStef33 — 2 days ago
▲ 22 r/ROCm

Managed to get 40 t/s on Qwen 27B (MTP) with an RX 6800 XT - Sharing my optimized fork

Hey everyone,

I’m pretty new to the ROCm scene, but I’ve been spending a lot of time lately trying to push the limits of my RX 6800 XT. I’ve been using Gemini to help me navigate the more technical C++ side of things and to troubleshoot some of the common memory issues we run into on Team Red.

After a lot of trial and error, I’ve put together a fork of llama.cpp that integrates TurboQuant and stabilizes Multi-Token Prediction (MTP) specifically for HIP/ROCm.

With this setup, I'm hitting about 40 t/s during generation on Qwen 2.5 27B (IQ4_XS) with a 32k context. For a 16GB card, I'm really happy with the stability. I had to fix some syntax errors in the graph logic that were causing double-free crashes when the VRAM got near its limit at high context, and I've tuned the batch settings to play nicer with RDNA 2.

If anyone else is running an AMD card and wants to try it out, I’ve uploaded the code and a basic build guide here:https://github.com/Stormrage34/llama.cpp-turboquant-hip

It's still a work in progress, but the performance boost over the standard implementation was significant enough that I thought it was worth sharing with the community. Let me know if you run into any issues or if you have suggestions for further AMD-specific optimizations.

reddit.com
u/CryptoStef33 — 4 days ago