u/nasone32

I noticed a few people are trying to run Qwen3-27B MTP on AMD GPUs and running into VRAM/OOM issues, so I wanted to share what worked for me.

I’m running it on a 7900 XTX and I’m getting around 75 tokens/s, which I’m very happy with.

The quant I used is this one:

https://huggingface.co/froggeric/Qwen3.6-27B-MTP-GGUF

in the ~~Q4_K_XL~~ Edit: Q4 K M flavour; I used the llama.cpp branch indicated in that repo.

My setup:

Windows 10
AMD Radeon 7900 XTX
Latest AMD drivers
Latest Vulkan SDK
VS Code 2026
Built llama.cpp from source
Launched the model immediately after compiling

Nothing fancy on the system side.

The important part seems to be using the right GGUF quant and the correct llama.cpp branch linked by the model author. With this setup I was able to run the model without the immediate OOM problems that others were seeing.

For reference, someone in the Qwen subreddit mentioned that they could barely get a 27B Q3 running on headless Debian with 32k context and Q4_0 KV cache, and that it would often OOM on the first message. On my Windows + Vulkan setup, this quant worked much better.

I also used ChatGPT to help me through the compile/setup steps; here’s the chat link:

https://chatgpt.com/share/69fd7345-b24-8396-8e54-d769d0e615d

sorry the chat is in Italian and I don't have the time to write a proper post right now, but maybe this is enough to get some people through. I also didn't try max context maybe I will try this evening, i'm sure 56k is doable with q8/q8 but I think close to 100k should be achievable with some tinkering. cheers

EDIT: i know this is called r/ROCm and I used vulkan instead, lol, but I think this was the most appropriate place to post this due to the userbase of this sub.

Got Qwen3-27B MTP running on AMD 7900 XTX at ~75 tok/s using llama.cpp