u/Limp_Doubt6411

Recent AMD/ROCm updates finally made local AI inference stable and I couldn't be happier.

Back in early 2025, I was running Mistral 7B CUDA with a custom HIP converter I built myself just to get it working on AMD. Now it runs natively without any of that. What a difference.

The system choice was intentional — RX 7900 XTX + Ryzen 9, partly for the price, but mainly because AMD's FP throughput and memory characteristics worked better for my specific workload. Some parts of my experimental pipeline were unstable on NVIDIA for reasons I still need to investigate.

Context length is still the limiting factor on a single local machine. My plan is to keep the core logic local and connect to a server for heavier lifting. The biggest win is keeping my AI in a safe place — protected from model updates and external changes.

One thing I'd like to see: better quantization support in vLLM. I understand it's server-oriented by design, but native quantization support for consumer GPUs would go a long way.

Setup

GPU: AMD Radeon RX 7900 XTX (24GB / gfx1100)
CPU: AMD Ryzen 9 9950X3D
OS: Ubuntu 24.04.2 LTS
ROCm: 7.2.3
Stack: llama.cpp (GGML_HIP=ON) + vLLM (ROCm)

Benchmark Results

Gemma 4 26B A4B — llama.cpp (HIP) Q4_K_M — PP: ~3355 t/s / TG: ~102 t/s
Qwen2.5-7B — vLLM (ROCm) FP16 — PP: ~3410 t/s / TG: ~56 t/s
Gemma 2 9B — llama.cpp (HIP) Q4_K_M — PP: ~2773 t/s / TG: ~79 t/s

PP = Prompt Processing (prefill), TG = Token Generation (decode)

The critical flag for llama.cpp

Building without -DGGML_HIP=ON compiles fine but silently falls back to CPU. No warning.

cmake -B build \
  -DGGML_HIP=ON \
  -DAMDGPU_TARGETS="gfx1100" \
  -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_C_COMPILER=/opt/rocm/bin/hipcc \
  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
  -DCMAKE_PREFIX_PATH=/opt/rocm-7.2.3

cmake --build build --config Release -j$(nproc)

Docker setup

docker run -it \
  --device=/dev/kfd \
  --device=/dev/dri/card0 \
  --device=/dev/dri/renderD128 \
  --group-add video \
  -v /your/model/path:/workspace \
  rocm/pytorch:latest bash

Use code with caution.

Running

bash

HIP_VISIBLE_DEVICES=0 ./build/bin/llama-server \
  -m /workspace/your-model.gguf \
  -ngl 99 \
  --host 0.0.0.0 \
  --port 8000

HIP_VISIBLE_DEVICES=0 — stops ROCm from picking up the CPU iGPU as a second device
-ngl 99 — loads all layers to GPU. Without this, it runs on CPU regardless of build

Lazy startup script

Got tired of typing the same commands every time:

#!/bin/bash
docker start gemma2-vllm
docker exec -it gemma2-vllm bash -c "
cd /workspace/llama.cpp &amp;&amp; \
HIP_VISIBLE_DEVICES=0 ./build/bin/llama-server \
  -m /workspace/your-model.gguf \
  -ngl 99 \
  --host 0.0.0.0 \
  --port 8000
"

Save as start_model.sh, chmod +x, done.

Model

Quantized Gemma 4 26B A4B on this setup — original 48GB → 16GB Q4_K_M.

https://huggingface.co/rakisis-core/Gemma-4-26B-A4B-Q4K_M-GGUF

---

**Full setup, scripts & guides:**

https://github.com/xinkanglabs/rocm-local-ai-stack

---

— XinXin-Kang / Xinkang Labs 🌐 xinkanglabs.com.au

AMD RX 7900 XTX + ROCm + Gemma 4 26B — here's what actually worked for me

The birds still come every morning, but I don’t know if I should feed them anymore

What are the basic etiquette rules for posting on Reddit?