
Got Qwen3-27B MTP running on AMD 7900 XTX at ~75 tok/s using llama.cpp
I noticed a few people are trying to run Qwen3-27B MTP on AMD GPUs and running into VRAM/OOM issues, so I wanted to share what worked for me.
I’m running it on a 7900 XTX and I’m getting around 75 tokens/s, which I’m very happy with.
The quant I used is this one:
https://huggingface.co/froggeric/Qwen3.6-27B-MTP-GGUF
in the Q4_K_XL Edit: Q4 K M flavour; I used the llama.cpp branch indicated in that repo.
My setup:
- Windows 10
- AMD Radeon 7900 XTX
- Latest AMD drivers
- Latest Vulkan SDK
- VS Code 2026
- Built llama.cpp from source
- Launched the model immediately after compiling
Nothing fancy on the system side.
The important part seems to be using the right GGUF quant and the correct llama.cpp branch linked by the model author. With this setup I was able to run the model without the immediate OOM problems that others were seeing.
For reference, someone in the Qwen subreddit mentioned that they could barely get a 27B Q3 running on headless Debian with 32k context and Q4_0 KV cache, and that it would often OOM on the first message. On my Windows + Vulkan setup, this quant worked much better.
I also used ChatGPT to help me through the compile/setup steps; here’s the chat link:
https://chatgpt.com/share/69fd7345-b24-8396-8e54-d769d0e615d
sorry the chat is in Italian and I don't have the time to write a proper post right now, but maybe this is enough to get some people through. I also didn't try max context maybe I will try this evening, i'm sure 56k is doable with q8/q8 but I think close to 100k should be achievable with some tinkering. cheers
EDIT: i know this is called r/ROCm and I used vulkan instead, lol, but I think this was the most appropriate place to post this due to the userbase of this sub.