u/Ginmarr

Setup: llama.cpp llama-bench, -fa 1 -ngl 99 -ctk q8_0 -ctv q8_0 -p 512,2048 -n 128,256 -r

3, 300 W power cap on both cards. Models are unsloth GGUFs (UD-IQ4_XS / UD-Q4_K_XL);

gpt-oss-20b is the ggml-org native MXFP4. R9700 = RDNA4/gfx1201, 7900 XTX = RDNA3/gfx1100.

R9700 runs measured one day earlier, identical config.

Takeaways:

- 7900 XTX beats the R9700 by +24–29% on token-gen across the whole slate — memory

bandwidth (384-bit vs 256-bit).

- Vulkan > ROCm for token-gen on both architectures — huge on MoE (XTX: +33–64%).

- Prefill flips it: ROCm pp2048 is ~8–17% faster on dense models (e.g. Qwen-27B IQ4: ROCm

1022 vs Vulkan 870 t/s).

greetings Ginmarr