


AMD RX 7900 XTX + ROCm + Gemma 4 26B — here's what actually worked for me
Recent AMD/ROCm updates finally made local AI inference stable and I couldn't be happier.
Back in early 2025, I was running Mistral 7B CUDA with a custom HIP converter I built myself just to get it working on AMD. Now it runs natively without any of that. What a difference.
The system choice was intentional — RX 7900 XTX + Ryzen 9, partly for the price, but mainly because AMD's FP throughput and memory characteristics worked better for my specific workload. Some parts of my experimental pipeline were unstable on NVIDIA for reasons I still need to investigate.
Context length is still the limiting factor on a single local machine. My plan is to keep the core logic local and connect to a server for heavier lifting. The biggest win is keeping my AI in a safe place — protected from model updates and external changes.
One thing I'd like to see: better quantization support in vLLM. I understand it's server-oriented by design, but native quantization support for consumer GPUs would go a long way.
Setup
- GPU: AMD Radeon RX 7900 XTX (24GB / gfx1100)
- CPU: AMD Ryzen 9 9950X3D
- OS: Ubuntu 24.04.2 LTS
- ROCm: 7.2.3
- Stack: llama.cpp (GGML_HIP=ON) + vLLM (ROCm)
Benchmark Results
- Gemma 4 26B A4B — llama.cpp (HIP) Q4_K_M — PP: ~3355 t/s / TG: ~102 t/s
- Qwen2.5-7B — vLLM (ROCm) FP16 — PP: ~3410 t/s / TG: ~56 t/s
- Gemma 2 9B — llama.cpp (HIP) Q4_K_M — PP: ~2773 t/s / TG: ~79 t/s
PP = Prompt Processing (prefill), TG = Token Generation (decode)
The critical flag for llama.cpp
Building without -DGGML_HIP=ON compiles fine but silently falls back to CPU. No warning.
cmake -B build \
-DGGML_HIP=ON \
-DAMDGPU_TARGETS="gfx1100" \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_C_COMPILER=/opt/rocm/bin/hipcc \
-DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
-DCMAKE_PREFIX_PATH=/opt/rocm-7.2.3
cmake --build build --config Release -j$(nproc)
Docker setup
docker run -it \
--device=/dev/kfd \
--device=/dev/dri/card0 \
--device=/dev/dri/renderD128 \
--group-add video \
-v /your/model/path:/workspace \
rocm/pytorch:latest bash
Use code with caution.
Running
bash
HIP_VISIBLE_DEVICES=0 ./build/bin/llama-server \
-m /workspace/your-model.gguf \
-ngl 99 \
--host 0.0.0.0 \
--port 8000
HIP_VISIBLE_DEVICES=0— stops ROCm from picking up the CPU iGPU as a second device-ngl 99— loads all layers to GPU. Without this, it runs on CPU regardless of build
Lazy startup script
Got tired of typing the same commands every time:
#!/bin/bash
docker start gemma2-vllm
docker exec -it gemma2-vllm bash -c "
cd /workspace/llama.cpp && \
HIP_VISIBLE_DEVICES=0 ./build/bin/llama-server \
-m /workspace/your-model.gguf \
-ngl 99 \
--host 0.0.0.0 \
--port 8000
"
Save as start_model.sh, chmod +x, done.
Model
Quantized Gemma 4 26B A4B on this setup — original 48GB → 16GB Q4_K_M.
https://huggingface.co/rakisis-core/Gemma-4-26B-A4B-Q4K_M-GGUF
---
**Full setup, scripts & guides:**
https://github.com/xinkanglabs/rocm-local-ai-stack
---
— XinXin-Kang / Xinkang Labs 🌐 xinkanglabs.com.au