r/ROCm

▲ 76 r/ROCm+17 crossposts

Built an open-source one-prompt-to-cinematic-reel pipeline on a single GPU — FLUX.2 [klein] for character keyframes, Wan2.2-I2V for animation, vision critic with auto-retry, music + 9-language narration in the same pipeline

Shipped this for the AMD x lablab hackathon. Attached video is one of the actual reels the pipeline produced - one English sentence in, finished mp4 with characters, story, music, and voice-over out. ~45 minutes end-to-end on a single AMD Instinct MI300X. Every model is Apache 2.0 or MIT.

Pipeline (8 stages, all sequential on the same GPU):

Director Agent - Qwen3.5-35B-A3B (vLLM + AITER MoE) plans 6 shots from one sentence, returns structured JSON with character bibles, shot prompts, music brief, per-shot voice-over script, narration language
Character masters - FLUX.2 [klein] paints one canonical portrait per character. No LoRA training step - reference editing pins identity across shots by construction
Per-shot keyframes - FLUX.2 again with reference image. Sub-second per keyframe after warmup
Animation - Wan2.2-I2V-A14B, 81 frames @ 16 fps native. FLF2V for cut:false continuation arcs (last frame of shot N anchors first frame of shot N+1)
Vision critic - same Qwen3.5-35B reloaded with 10 structured failure labels (character drift, extras invade frame, camera ignored, walking backwards, object morphing, hand/finger artifact, wardrobe drift, neon glow leak, stylized AI look, random intimacy). Bad clips re-render with targeted retry strategies (different seed, FLF2V anchor, prompt simplification)
Music - ACE-Step v1 generates a 30s instrumental from Director's brief
Narration - Kokoro-82M, 9 languages. Director picks language to match setting (Tokyo→Japanese, Paris→French, Mumbai→Hindi)
Mix - ffmpeg with per-shot vo aligned via adelay

Wan 2.2 specifics (the bit this sub will care about):

1280×720, not 640×640 default. Costs more but matches what producers want
121 frames at 24 fps was my first attempt - gave temporal rippling. Switched to 81 @ 16 fps native (the distribution Wan was trained on) and it cleaned up
flow_shift = 5 for hero shots, 8 for b-roll (upstream wan_i2v_A14B.py defaults)
Negative prompt: verbatim Chinese trained negative from shared_config.py. umT5 was multilingual-pretrained against those exact tokens. English translation is observably weaker
Camera language: ONE camera verb per shot, sentence-case, placed first ("Tracking shot following from behind"). Multiple verbs in one prompt cancel each other out
Avoid the word "cinematic" - triggers Wan's stylization branch, gives the AI look. Use lens/film tags instead ("Arri Alexa, anamorphic, 35mm film grain")

Performance work:

ParaAttention FBCache (lossless 2× on Wan2.2)
torch.compile on transformer_2 (selective, the dual-expert MoE makes full compile flaky) - another 1.2×
AITER MoE acceleration on Qwen director (vLLM)
End-to-end: 25.9 min → 10.4 min per 720p clip on MI300X

Why a single MI300X: 192 GB HBM3 lets a 35B MoE, 4B diffusion, 14B I2V MoE, 3.5B music, and a TTS share the same card sequentially. Same stack on a 24 GB consumer GPU would need 4-5 boxes wired together.

Code (public, Apache 2.0): https://github.com/bladedevoff/studiomi300

Hugging Face (documentation, like this space 🙏) https://huggingface.co/spaces/lablab-ai-amd-developer-hackathon/studiomi300

Live demo on HF Space is temporarily offline while infra restores - should be back within hours. In the meantime the showcase reels in the repo are real pipeline outputs, no human re-edited shots.

Happy to dig into AITER MoE setup, FBCache tuning, FLF2V anchoring, or the vision critic's failure taxonomy in comments.

u/Inevitable-Log5414 — 4 hours ago

▲ 52 r/ROCm

I got tired of hunting AMD GPU + AI configs across blog posts and Discord threads, so I built a curated index — rocmate

Every time I set up a new AI tool on my RX 7900 XTX, I spent hours

digging through GitHub issues, outdated blog posts, and Discord threads

just to find the right HSA_OVERRIDE value or the correct PyTorch ROCm

wheel URL. Information exists, but it's scattered and rarely chip-specific.

So I built rocmate — a version-controlled compatibility index + CLI that

tells you what works on your specific AMD GPU:

pip install rocmate
rocmate doctor        # check your system
rocmate show ollama   # see tested config for your chip
rocmate install ollama # install with correct ENV vars

Stable Diffusion WebUI, vLLM, Axolotl, ExLlamaV2) across 5 chip

generations (gfx1100, gfx1101, gfx1102, gfx1030, gfx1034).

What I actually need from this community: configs for chips I don't own.

If you have an RX 6700 (gfx1031), RX 5700 (gfx1010), or any RDNA1 card,

and you've gotten any of these tools running — a 5-minute PR with your

config would help everyone with the same hardware.

GitHub: https://github.com/T0nd3/rocmate

PyPI: https://pypi.org/project/rocmate/

u/T0nd3 — 15 hours ago

▲ 41 r/ROCm

We squeezed 4x MoE prefill speed out of an RX 6800 XT by rewriting the matmul kernel in llama.cpp

Hey everyone,

I've been working on a fork of llama.cpp focused on making AMD GPUs first-class citizens for LLM inference. After months of profiling and kernel-level work, we just pushed v0.3.0 with some results worth sharing.

The short version: on a 35B MoE model (IQ4_XS quantized), prefill went from ~480 t/s to 1770 t/s on an RX 6800 XT. Dense models stayed flat at 480 t/s, which is expected since the optimization targets the small-matrix multiply pattern that MoE routing creates.

Why we did this:

The upstream llama.cpp treats AMD GPUs as "just another backend." The kernels are written for NVIDIA and ported over. We found that the dequantization path was leaving massive bandwidth on the table on RDNA2, and the matmul kernel for MoE models was completely memory-bound. So we went in at the HIP level.

What we shipped:

- A BFE-based dequantization kernel for IQ4_XS that runs 13x faster in isolation

- An async pipeline that overlaps dequant launches with compute, cutting kernel launch overhead by 31%

- An experimental LDS double-buffered matmul kernel that overlaps weight loading with DP4A compute. This is where the 4x gain comes from. It's behind a flag because the latency variance is still too high for production use. We know why (LDS bank conflicts on symmetric tile dimensions) and we already have the fix planned.

The experimental flag is there because we believe in shipping transparently. The gain is real, the variance is real too, and we'd rather let people benchmark it themselves than pretend it's stable.

If you're running AMD hardware and want to try it, the build scripts and benchmark harness are in the repo. No CMake changes needed.

GitHub: https://github.com/Stormrage34/llama.cpp-turboquant-hip

Happy to answer questions about the kernel work, the profiling process, or why MoE models benefit so much more than dense ones.

u/CryptoStef33 — 2 days ago

▲ 26 r/ROCm

AMD RX 7900 XTX + ROCm + Gemma 4 26B — here's what actually worked for me

Recent AMD/ROCm updates finally made local AI inference stable and I couldn't be happier.

Back in early 2025, I was running Mistral 7B CUDA with a custom HIP converter I built myself just to get it working on AMD. Now it runs natively without any of that. What a difference.

The system choice was intentional — RX 7900 XTX + Ryzen 9, partly for the price, but mainly because AMD's FP throughput and memory characteristics worked better for my specific workload. Some parts of my experimental pipeline were unstable on NVIDIA for reasons I still need to investigate.

Context length is still the limiting factor on a single local machine. My plan is to keep the core logic local and connect to a server for heavier lifting. The biggest win is keeping my AI in a safe place — protected from model updates and external changes.

One thing I'd like to see: better quantization support in vLLM. I understand it's server-oriented by design, but native quantization support for consumer GPUs would go a long way.

Setup

GPU: AMD Radeon RX 7900 XTX (24GB / gfx1100)
CPU: AMD Ryzen 9 9950X3D
OS: Ubuntu 24.04.2 LTS
ROCm: 7.2.3
Stack: llama.cpp (GGML_HIP=ON) + vLLM (ROCm)

Benchmark Results

Gemma 4 26B A4B — llama.cpp (HIP) Q4_K_M — PP: ~3355 t/s / TG: ~102 t/s
Qwen2.5-7B — vLLM (ROCm) FP16 — PP: ~3410 t/s / TG: ~56 t/s
Gemma 2 9B — llama.cpp (HIP) Q4_K_M — PP: ~2773 t/s / TG: ~79 t/s

PP = Prompt Processing (prefill), TG = Token Generation (decode)

The critical flag for llama.cpp

Building without -DGGML_HIP=ON compiles fine but silently falls back to CPU. No warning.

cmake -B build \
  -DGGML_HIP=ON \
  -DAMDGPU_TARGETS="gfx1100" \
  -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_C_COMPILER=/opt/rocm/bin/hipcc \
  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
  -DCMAKE_PREFIX_PATH=/opt/rocm-7.2.3

cmake --build build --config Release -j$(nproc)

Docker setup

docker run -it \
  --device=/dev/kfd \
  --device=/dev/dri/card0 \
  --device=/dev/dri/renderD128 \
  --group-add video \
  -v /your/model/path:/workspace \
  rocm/pytorch:latest bash

Use code with caution.

Running

bash

HIP_VISIBLE_DEVICES=0 ./build/bin/llama-server \
  -m /workspace/your-model.gguf \
  -ngl 99 \
  --host 0.0.0.0 \
  --port 8000

HIP_VISIBLE_DEVICES=0 — stops ROCm from picking up the CPU iGPU as a second device
-ngl 99 — loads all layers to GPU. Without this, it runs on CPU regardless of build

Lazy startup script

Got tired of typing the same commands every time:

#!/bin/bash
docker start gemma2-vllm
docker exec -it gemma2-vllm bash -c "
cd /workspace/llama.cpp &amp;&amp; \
HIP_VISIBLE_DEVICES=0 ./build/bin/llama-server \
  -m /workspace/your-model.gguf \
  -ngl 99 \
  --host 0.0.0.0 \
  --port 8000
"

Save as start_model.sh, chmod +x, done.

Model

Quantized Gemma 4 26B A4B on this setup — original 48GB → 16GB Q4_K_M.

https://huggingface.co/rakisis-core/Gemma-4-26B-A4B-Q4K_M-GGUF

---

**Full setup, scripts & guides:**

https://github.com/xinkanglabs/rocm-local-ai-stack

---

— XinXin-Kang / Xinkang Labs 🌐 xinkanglabs.com.au

u/Limp_Doubt6411 — 1 day ago

▲ 22 r/ROCm

Managed to get 40 t/s on Qwen 27B (MTP) with an RX 6800 XT - Sharing my optimized fork

Hey everyone,

I’m pretty new to the ROCm scene, but I’ve been spending a lot of time lately trying to push the limits of my RX 6800 XT. I’ve been using Gemini to help me navigate the more technical C++ side of things and to troubleshoot some of the common memory issues we run into on Team Red.

After a lot of trial and error, I’ve put together a fork of llama.cpp that integrates TurboQuant and stabilizes Multi-Token Prediction (MTP) specifically for HIP/ROCm.

With this setup, I'm hitting about 40 t/s during generation on Qwen 2.5 27B (IQ4_XS) with a 32k context. For a 16GB card, I'm really happy with the stability. I had to fix some syntax errors in the graph logic that were causing double-free crashes when the VRAM got near its limit at high context, and I've tuned the batch settings to play nicer with RDNA 2.

If anyone else is running an AMD card and wants to try it out, I’ve uploaded the code and a basic build guide here:https://github.com/Stormrage34/llama.cpp-turboquant-hip

It's still a work in progress, but the performance boost over the standard implementation was significant enough that I thought it was worth sharing with the community. Let me know if you run into any issues or if you have suggestions for further AMD-specific optimizations.

reddit.com

u/CryptoStef33 — 4 days ago

▲ 80 r/ROCm+1 crossposts

More Qwen3.6-27B MTP success but on dual Mi50s

TLDR: The hype is real! 1.5x speedup. Up to 2x speedup with tensor parallelism!

After reading the PR I immediately hunted for MTP-compatible Q4_1 quants (they offer a small speedup on these compute-lacking older cards) but couldn't find any.

Luckily I came across this post which highlighted how to transplant MTP grafting onto your own quants, and thus attached it to an Unsloth quant I already had.

Setup

CachyOS (Arch Linux)
ROCm 7.2
Both cards running at PCIe 4.0 x 8

Built the llama.cpp fork https://github.com/skyne98/llama.cpp-gfx906 with https://github.com/ggml-org/llama.cpp/pull/22673 and ran the following command with the included PR benchmark script:

llama-server -m ~/models/Qwen3.6-27B-MTP-Q4_1.gguf \
--temp 1.0 --min-p 0.0 --top-k 20 --top-p 0.95 \
--jinja --presence-penalty 1.5 \
--chat-template-kwargs '{"preserve_thinking": true}' \
-ub 2048 -b 2048 \
-fa 1 -np 1 \
--no-mmap --no-warmup \
-dev ROCm0,ROCm1 --fit on -fitt 256

Script Benchmark

Stock:

code_python        pred= 192 draft=   0 acc=   0 rate=n/a tok/s=26.2
code_cpp           pred= 192 draft=   0 acc=   0 rate=n/a tok/s=26.2
explain_concept    pred= 192 draft=   0 acc=   0 rate=n/a tok/s=26.3
summarize          pred= 192 draft=   0 acc=   0 rate=n/a tok/s=26.4
qa_factual         pred= 192 draft=   0 acc=   0 rate=n/a tok/s=26.4
translation        pred= 192 draft=   0 acc=   0 rate=n/a tok/s=26.4
creative_short     pred= 192 draft=   0 acc=   0 rate=n/a tok/s=26.4
stepwise_math      pred= 192 draft=   0 acc=   0 rate=n/a tok/s=26.3
long_code_review   pred= 192 draft=   0 acc=   0 rate=n/a tok/s=26.0

With MTP on: --spec-type mtp --spec-draft-n-max 2

code_python        pred= 192 draft= 144 acc= 119 rate=0.826 tok/s=39.6
code_cpp           pred= 192 draft= 156 acc= 113 rate=0.724 tok/s=36.5
explain_concept    pred= 192 draft= 154 acc= 113 rate=0.734 tok/s=36.7
summarize          pred= 192 draft= 138 acc= 121 rate=0.877 tok/s=40.7
qa_factual         pred= 192 draft= 144 acc= 119 rate=0.826 tok/s=39.4
translation        pred= 192 draft= 152 acc= 115 rate=0.757 tok/s=37.5
creative_short     pred= 192 draft= 156 acc= 113 rate=0.724 tok/s=36.6
stepwise_math      pred= 192 draft= 146 acc= 118 rate=0.808 tok/s=39.0
long_code_review   pred= 192 draft= 150 acc= 115 rate=0.767 tok/s=37.8

Aggregate: {
 "n_requests": 9,
 "total_predicted": 1728,
 "total_draft": 1340,
 "total_draft_accepted": 1046,
 "aggregate_accept_rate": 0.7806,
 "wall_s_total": 51.42
}

With tensor parallelism on: -sm tensor

code_python        pred= 192 draft=   0 acc=   0 rate=n/a tok/s=35.0
code_cpp           pred= 192 draft=   0 acc=   0 rate=n/a tok/s=34.8
explain_concept    pred= 192 draft=   0 acc=   0 rate=n/a tok/s=34.6
summarize          pred= 192 draft=   0 acc=   0 rate=n/a tok/s=34.6
qa_factual         pred= 192 draft=   0 acc=   0 rate=n/a tok/s=34.7
translation        pred= 192 draft=   0 acc=   0 rate=n/a tok/s=34.7
creative_short     pred= 192 draft=   0 acc=   0 rate=n/a tok/s=34.7
stepwise_math      pred= 192 draft=   0 acc=   0 rate=n/a tok/s=34.6
long_code_review   pred= 192 draft=   0 acc=   0 rate=n/a tok/s=34.3

Combining MTP and tensor parallelism:

code_python        pred= 192 draft= 142 acc= 120 rate=0.845 tok/s=59.8
code_cpp           pred= 192 draft= 148 acc= 116 rate=0.784 tok/s=56.6
explain_concept    pred= 192 draft= 146 acc= 117 rate=0.801 tok/s=56.8
summarize          pred=  53 draft=  42 acc=  31 rate=0.738 tok/s=54.5
qa_factual         pred= 192 draft= 148 acc= 117 rate=0.790 tok/s=56.8
translation        pred= 192 draft= 146 acc= 117 rate=0.801 tok/s=57.3
creative_short     pred= 192 draft= 154 acc= 114 rate=0.740 tok/s=54.8
stepwise_math      pred= 192 draft= 140 acc= 121 rate=0.864 tok/s=59.6
long_code_review   pred= 192 draft= 148 acc= 117 rate=0.790 tok/s=56.2

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1589,
  "total_draft": 1214,
  "total_draft_accepted": 970,
  "aggregate_accept_rate": 0.799,
  "wall_s_total": 32.24

Real-world benchmark

The numbers above look absolutely insane, however in the real-world the speed up dwindles very quickly - not to mention there's a regression in prefill speed which is currently being worked on. I ran this 18k coding prompt and it's clear the 60t/s is only observable for very short prompts, but combining MTP and tensor parallelism does indeed net a hefty 2x speedup.

Stock:

prompt eval time =   53173.24 ms / 19191 tokens (    2.77 ms per token,   360.91 tokens per second)
      eval time =  337695.94 ms /  7791 tokens (   43.34 ms per token,    23.07 tokens per second)
     total time =  390869.18 ms / 26982 tokens

With MTP on:

prompt eval time =   84388.11 ms / 19191 tokens (    4.40 ms per token,   227.41 tokens per second)
      eval time =  260732.83 ms /  8408 tokens (   31.01 ms per token,    32.25 tokens per second)
     total time =  345120.94 ms / 27599 tokens

With tensor parallelism:

prompt eval time =   41925.27 ms / 19191 tokens (    2.18 ms per token,   457.74 tokens per second)
       eval time =  253262.25 ms /  8104 tokens (   31.25 ms per token,    32.00 tokens per second)
      total time =  295187.53 ms / 27295 tokens

Combining MTP and tensor parallelism:

prompt eval time =   49696.04 ms / 19191 tokens (    2.59 ms per token,   386.17 tokens per second)
       eval time =  155821.64 ms /  7440 tokens (   20.94 ms per token,    47.75 tokens per second)
      total time =  205517.69 ms / 26631 tokens

u/legit_split_ — 5 days ago

▲ 26 r/ROCm+2 crossposts

fine-tuning 27B hybrid models on strix halo (ryzen ai max+ 395 / gfx1151, 128 gb unified memory) — full guide, patches, orchestrator

Sharing a guide I just published for fine-tuning 27B+ LLMs on AMD Strix Halo (Ryzen AI MAX+ 395, Radeon 8060S / gfx1151, 128 GB unified memory). MIT licensed.

Repo: https://github.com/h34v3nzc0dex/strix-halo-llm-finetune-guide

None of the individual pieces are novel — kernel patches, ROCm 7.13 nightly, FLA, bitsandbytes, LoRA, llama.cpp. The intersection (Strix Halo + gfx1151 + FLA + Qwen3.5 hybrid at 27B) isn't documented anywhere I could find, and getting it stable took a lot of dead ends I'd rather other people skip.

Stack tested: kernel 6.19.14, PyTorch 2.11.0+rocm7.13.0a20260506, ROCm 7.13 nightly, FLA 0.5.1 patched, bitsandbytes 0.50.0.dev0 built from source for gfx1151, llama.cpp b867+. Hardware: Corsair AI Workstation 300 (Sixunited AXB35-02 board, BIOS 3.07).

Things the guide actually covers that I had to figure out the hard way:

PyPI bitsandbytes ships zero ROCm binaries. From-source build with -DROCM_VERSION=83, plus a runtime symlink libbitsandbytes_rocm83.so → libbitsandbytes_rocm713.so so bnb's HIP detection on PyTorch 2.10/2.11 stops complaining.
FLA's Triton kernels crash on gfx1151 (RDNA 3.5) with num_warps > 4 (Triton#5609) and a tl.cumsum + tl.sum codegen interaction (Triton#3017). Idempotent re-patch script included.
In-process Trainer eval at 27B / 8192 seq length is structurally broken on unified-memory APUs — either kernel TTM page allocation failure from fragmentation, or memory watchdog SIGKILL when free RAM drops under ~8 GB. Eval is moved out-of-process via a bash orchestrator aligned to save_steps, waiting for full GPU release between train and eval, with a JSONL trend log.
Mainline kernel .deb run-parts double-dir bug on Ubuntu 24.04+ leaves packages half-configured. Repack script included.
/srv perms regressing to 0750 mid-training breaks importlib.metadata path traversal and crashes TRL's create_model_card. Cron watchdog restoring 755.

Verified result: in-progress production fine-tune of Qwen3.5-27B (hybrid, 16 full-attention + 48 GatedDeltaNet layers), bf16 LoRA r=128/α=256, eval rolling at 0.13 loss / 96.5% token accuracy, ~11 min/step, ~4-day total runtime.

Feedback and issues welcome, especially from people on different AXB35-02 boards or non-Corsair Strix Halo systems — I'd like to know what's board-specific vs. generic.

https://preview.redd.it/8i3ebs27h00h1.jpg?width=649&format=pjpg&auto=webp&s=1a4fe453e9e46c97b71a14b993b9536288169ca1

reddit.com

u/Outrageous_Bug_669 — 1 day ago

▲ 3 r/ROCm

Turboquant+MTP for ROCM

TL;DR: I got TBQ4 KV cache + MTP working on AMD ROCm for RX 7900 XTX / RDNA3 / gfx1100 in llama.cpp. Main win: 64k context fits on 24 GB VRAM and remains usable.

Branch: tbq4-rdna3-experiment

I dug into TurboQuant / TBQ4 + MTP on AMD because the existing AMD paths were incomplete or broken for my setup. This branch uses the ROCm VEC Flash Attention path with inline TBQ4 dequant.

Test setup:

- RX 7900 XTX, 24 GB

- RDNA3 / gfx1100

- ROCm / HIP

- Qwen3.6-27B Q4_K_M MTP GGUF

- tbq4_0 KV cache

- MTP with --spec-draft-n-max 3

Current numbers:

- tbq4_0, 64k ctx: 38–54 tok/s, ~20 GB VRAM

- Prefill: 537.7 tok/s at 16k; 360.8 tok/s in the 64k test

- q8_0 baseline: ~49.8 tok/s at 16k, ~31 tok/s at 32k, ~22–23 GB VRAM

Caveats:

- RX 7900 XTX is RDNA3 / gfx1100, not RDNA3.5.

- RDNA3.5 / RDNA4 are enabled but untested.

- RotorQuant / PlanarQuant / IsoQuant are present but not validated.

- These are reported points from separate runs, not a clean scaling curve.

Happy for a few test reports.

Useful bug reports > hype.

https://github.com/DrBearJew/llama.cpp/tree/tbq4-rdna3-experiment

reddit.com

u/DrBearJ3w — 5 hours ago

▲ 19 r/ROCm

what is the current state of pytorch and Ai coding functionality on AMD cards?

hello everyone. I am planning to buy a gpu to do ai training on it (im a master student) and currently any nvidia card that has 24gbs of vram is tooo expensive like even the used ones. i was wondering if it is worth the trouble to settle for a rx 7900 xtx which has 24 gb of vram and 960 gb/s of memory bandwidth. or should i settle for a used 3090 . like if you could share your latest experience of doing ai training on amd i would really appreciate it thanx

reddit.com

u/SmoothOne2913 — 4 days ago

▲ 35 r/ROCm+2 crossposts

Isaac Sim 5.1.0 on AMD Radeon RX 7800 XT

I have been developing a project called the Ghost Environment to prove that

hardware vendor lock in is a software choice rather than a physical limitation.

Today I reached a significant milestone by successfully initializing NVIDIA

Isaac Sim 5.1.0 on an AMD Radeon RX 7800 XT.

Technical Overview: The system operates as a Rust based hypervisor that

intercepts proprietary API calls at the system level. It utilizes JIT compiled

C++ stubs to spoof the NVIDIA Management Library and a specialized ZLUDA fork to

translate CUDA math kernels into AMD compatible instructions in real time.

Current State and Performance: The engine reached the app ready state in 16

seconds with near zero overhead. It is important to note that the viewport is

currently fully black as OptiX and hardware accelerated Ray Tracing support have

not been implemented yet. However the core physics engine and UI are fully

operational and the hardware gate is officially bypassed.

Release Status: This specific build featuring Isaac Sim and Omniverse support is

currently in private beta and has not been released to the public repository

yet. I am finalizing the internal logic to ensure the system is stable before

the official launch.

If you would like to follow the development or be notified when the full release

drops please star or watch the repository on GitHub at

https://github.com/Void-Compute/AMD-Ghost-Enviroment

I am 15 years old and I engineered this because I wanted to break the walls of a

closed ecosystem. If I can do this anyone can. You have the power to achieve great things.

u/ChrisGamer5013 — 4 days ago

▲ 37 r/ROCm

Got Qwen3-27B MTP running on AMD 7900 XTX at ~75 tok/s using llama.cpp

I noticed a few people are trying to run Qwen3-27B MTP on AMD GPUs and running into VRAM/OOM issues, so I wanted to share what worked for me.

I’m running it on a 7900 XTX and I’m getting around 75 tokens/s, which I’m very happy with.

The quant I used is this one:

https://huggingface.co/froggeric/Qwen3.6-27B-MTP-GGUF

in the ~~Q4_K_XL~~ Edit: Q4 K M flavour; I used the llama.cpp branch indicated in that repo.

My setup:

Windows 10
AMD Radeon 7900 XTX
Latest AMD drivers
Latest Vulkan SDK
VS Code 2026
Built llama.cpp from source
Launched the model immediately after compiling

Nothing fancy on the system side.

The important part seems to be using the right GGUF quant and the correct llama.cpp branch linked by the model author. With this setup I was able to run the model without the immediate OOM problems that others were seeing.

For reference, someone in the Qwen subreddit mentioned that they could barely get a 27B Q3 running on headless Debian with 32k context and Q4_0 KV cache, and that it would often OOM on the first message. On my Windows + Vulkan setup, this quant worked much better.

I also used ChatGPT to help me through the compile/setup steps; here’s the chat link:

https://chatgpt.com/share/69fd7345-b24-8396-8e54-d769d0e615d

sorry the chat is in Italian and I don't have the time to write a proper post right now, but maybe this is enough to get some people through. I also didn't try max context maybe I will try this evening, i'm sure 56k is doable with q8/q8 but I think close to 100k should be achievable with some tinkering. cheers

EDIT: i know this is called r/ROCm and I used vulkan instead, lol, but I think this was the most appropriate place to post this due to the userbase of this sub.

u/nasone32 — 6 days ago

▲ 4 r/ROCm

Isaac Sim 5.1.0 Audited on AMD Silicon.

I finally got NVIDIA Isaac Sim 5.1.0 to boot on an RX 7800 XT and the logs are a

total disaster. The industry treats this software like a fortress but after

looking at the telemetry it is clear the Green Moat or often reffered to as the walled garden is just aluminum foil.

My Ghost Hypervisor forced the stack to acknowledge the hardware truth. The log

explicitly reports:

cuda 0 : AMD Radeon RX 7800 XT [ZLUDA] (16 GiB, sm_88, mempool not supported).

The app says Active Yes and hits App Ready in 16.390s. It thinks it has a 4090

but it is currently undergoing a logic breakdown because it cannot find a UVM

driver on a device it already initialized.

The initialization log is a funeral procession for legacy code. I identified 34

distinct architectural deprecations in a single boot cycle this was only at startup without even putting an object inside isaac sim:

1 pxr.Semantics is deprecated

2 warp.sim module is deprecated

3 omni.isaac.nucleus has been deprecated

4 omni.isaac.range_sensor has been

deprecated

5 omni.isaac.asset_browser has been deprecated

6 omni.isaac.assets_check has been deprecated

7 omni.isaac.cloner has been

deprecated

8 omni.isaac.core_nodes has been deprecated

9 omni.isaac.cortex has

been deprecated

10 omni.isaac.franka has been deprecated

11 omni.isaac.kit has

been deprecated

12 omni.isaac.quadruped has been deprecated

13 omni.isaac.lula

has been deprecated

14 omni.isaac.sensor has been deprecated

15 omni.isaac.surface_gripper has been deprecated

16 omni.isaac.universal_robots

has been deprecated

17 omni.isaac.wheeled_robots has been deprecated

18 omni.isaac.window.about has been deprecated

19 omni.isaac.core has been

deprecated

20 omni.kit.property.isaac has been deprecated

21 omni.replicator.isaac has been deprecated

22 omni.isaac.lula_test_widget has

been deprecated

23 omni.isaac.menu has been deprecated

24 omni.isaac.motion_generation has been deprecated 25 omni.isaac.block_world has

been deprecated

26 omni.isaac.grasp_editor has been deprecated

27 omni.isaac.occupancy_map has been deprecated

28 omni.isaac.robot_assembler has

been deprecated

29 omni.isaac.scene_blox has been deprecated

30 omni.isaac.synthetic_recorder has been deprecated 31 omni.isaac.throttling has

been deprecated

32 omni.isaac.physics_inspector has been deprecated 33 omni.isaac.range_sensor.ui has been deprecated

34 omni.isaac.range_sensor.examples has been deprecated

This flagship software is a digital graveyard held together by legacy shims that

do nothing but increase instruction latency.

The professionalism of the internal stack is non existent. The logs reveal a

service named pipapi that triggers this alert:

Warning [omni.kit.pipapi.pipapi] extension omni.kit.widget.cache_indicator has a

python.pipapi entry but use_online_index true is not set. It does not do

anything and can be removed.

(Note: This likely refers to the Python Package Installer Pip API, though the implementation and spacing suggest a lack of semantic rigor.)

This is not enterprise engineering. This is a system held together by hopes and

prayers.

Current Objective: Phase 2.

Void Compute is currently executing the mapping of the stateless OptiX 7.x

function table to the AMD HIP-RT backend. This involves the interception of the

OptixFunctionTable and the JIT translation of Shader Binding Tables into RDNA 3

compatible acceleration structures. By bridging the gap between the stateless

OptiX API and the HIP ray tracing dispatchers I am eliminating the proprietary

dependency at the instruction level.

reddit.com

u/ChrisGamer5013 — 1 day ago

▲ 24 r/ROCm

Tried ROCm 7.1 vs Vulkan/RADV on Radeon 890M for LLM inference (8B and 35B-MoE). Vulkan won both. Why?

Posting because I expected the opposite result and I want to know if I

misconfigured ROCm or if this is the actual state of things on Radeon 890M

class iGPUs.

Hardware: Beelink SER9 Pro, Radeon 890M iGPU (16 RDNA 3.5 CUs), 32GB

LPDDR5x-7500. Ubuntu 24.04, kernel 6.11.

Two backends tested:

ROCm 7.1 — installed via the official AMD repo. gfx1150 target (gfx1100

binary fallback because gfx1150 isn't fully supported yet). Built

llama.cpp with -DGGML_HIPBLAS=ON.
Vulkan/RADV — mesa 24.x, llama.cpp (and LMStudio for the bigger model)

built with -DGGML_VULKAN=ON.

Two workloads:

WORKLOAD A — Gemma 4 E4B Q8_0 (8B dense, full offload, 4K ctx):

- ROCm: ~12.5 tok/s

- Vulkan/RADV: ~16.0 tok/s

WORKLOAD B — Qwen 3.5 35B A3B Q4_K_M (35B MoE, 15–20 of ~48 layers offloaded,

4–8K ctx):

- ROCm: ~14 tok/s (had to fight harder to get this working with partial

offload — LMStudio's ROCm path on gfx1150 was less stable than its

Vulkan path)

- Vulkan/RADV via LMStudio: 20–22 tok/s steady

In both cases, same machine, same model file, same prompt. Power and

thermals were similar between backends — this is throughput, not

heat-throttling.

My read on why:

- gfx1150 (RDNA 3.5) doesn't have first-class kernel support in ROCm 7.1

yet. Falling back to gfx1100 binaries leaves perf on the table.

- The Vulkan backend in upstream llama.cpp got Wave32 flash-attention

+ graphics-queue scheduling patches in early 2026 that haven't landed

in the ROCm path yet.

- For the 890M's iGPU class specifically, the integrated nature means

memory bandwidth dominates, and Vulkan's path through RADV seems better

optimized for shared LPDDR5x access patterns.

- For partial offload specifically, Vulkan handles the GPU-CPU layer

boundary cleaner in LMStudio than ROCm did.

Open questions for the sub:

- Anyone running gfx1150-targeted ROCm builds (not gfx1100 fallback)?

Does perf shift?

- Is the picture different at the Strix Halo 8060S iGPU class? More CUs,

more bandwidth, possibly closer ROCm parity.

- ROCm build flag I'm missing for this iGPU class?

Not trying to dunk on ROCm — I want to use it for the unified-memory story

on iGPUs, but Vulkan is faster on this class today. Curious if that flips

with ROCm 8.x or with bigger silicon.

reddit.com

u/wolverinee04 — 7 days ago

▲ 1 r/ROCm+1 crossposts

About a month ago, I was dead set on getting a 9070 to replace my 3060 Ti. However, I’m planning to work on Gaussian Splatting for my college graduation project, so I’d like some advice before pulling the trigger on either card and regretting it later.

My current specs are:

Ryzen 7 9800X3D
32 GB DDR5
Fedora Linux as my main OS

I’m mainly worried about CUDA support for Gaussian Splatting workflows, and I’m still not fully sure about ROCm yet.

Unfortunately, GPU pricing in my region is pretty bad right now, so a 9070 XT or a 5070 Ti are completely out of my budget. My realistic options are just the 9070 and the 5070.

Any advice from people using these cards for this kind of work on Linux?

reddit.com

u/razadoop — 7 days ago

▲ 2 r/ROCm

Help: r9700 fails under Ubuntu 24

Came across claims there is an issue with newer VBIOS version and hoping someone has found a path forward or at least fellow suffers.

I was working with Claude trying to diagnose the problem, hence the "Root Cause" analysis below, but at this point I'm just not sure how to move forward.

Hardware

GPU: AMD Radeon AI PRO R9700 (XFX)
Device ID: 0x7551
ASIC: gfx1201 / Navi48
VBIOS: P02 dated 07/25/2025
Kernel: 6.17.0-23
Driver: amdgpu-dkms 6.16.13 (ROCm 7.2.3)
Ubuntu 24.04

Symptoms

amdgpu fails during IP discovery initialization:

no /dev/accel/accel0
no KFD
rocminfo shows CPU only

Diagnostic patch output:

amdgpu: ip discovery actual sig: 0xce854377 expected: 0x28211407

Root Cause

The July 2025 VBIOS writes a PSP-encrypted IP discovery binary into VRAM.

Current drivers expect the old plaintext signature: 0x28211407

But the new VBIOS emits: 0xce854377

The payload appears encrypted/ciphertext beyond the first few bytes:

00000000: 77 43 85 ce ...
00000010: 8b e3 d7 42 52 fb ...

amdgpu_discovery_check_binary_valid() rejects it immediately, so the GPU never fully initializes.

Interesting Findings

IP_DISCOVERY_V4 is defined in amdgpu_discovery.c
But there is no actual V4 implementation
Strongly suggests AMD planned support for this encrypted format but did not ship the decryption path yet

Also:

the ACPI/sysmem fallback path returns -ENOENT
consumer Ryzen desktop boards generally do not expose the required ACPI TMR table
so desktop users have no fallback path

Additional Problem

The encrypted blob appears larger than:

DISCOVERY_TMR_SIZE = 10240

So even after decryption support lands, the buffer size may also need increasing.

Likely Fixes

Implement PSP/TMR decryption support for the new 0xce854377 discovery format
Or provide a non-encrypted VBIOS compatible with existing drivers
Possibly increase DISCOVERY_TMR_SIZE

Impact

This may affect all Navi48 / gfx1201 / R9700 cards shipping with July 2025+ VBIOS revisions on Linux.

reddit.com

u/valalalalala — 6 days ago

▲ 6 r/ROCm

Im using rocm 7.2 and performace is soo inconsistent I used image z turbo ,yesterday was able to make 1008x1008 images in 20 sec ,anything over that rez was 80%slower but today that limit lower is 600x600 for 12sec gen anything above that might take 1 to 2 minutes i dont understand why?

win32

Python Version

3.12.11 (main, Aug 18 2025, 19:17:54) [MSC v.1944 64 bit (AMD64)]

Embedded Python false

Pytorch Version 2.9.1+rocm7.2.1

cuda:0 AMD Radeon RX 9070 XT : nativeT ype cuda

VRAM Total15.92 GB

VRAM Free15.77 GB

Torch VRAM Total0 B

Torch VRAM Free0 B

u/Glittering-Tough-353 — 6 days ago

▲ 5 r/ROCm+1 crossposts

Struggle on MI50(gfx906), very slow with just ~10k ctx, am I doing something wrong?

Hi I am new to localLLM and I got 4x AMD Instinct MI40 32GB(128GB total), with Supermicro h12ssl-i as mobo. I tried to use Qwen3.6 with Claude code, however even without referencing files or installing skills, mcp, the harness is already ~20k from start and I often see the tps dropped to 1 or even 0.1 from Omniroute's(api router) log panel.

While seeing other homelabbers easily having ~80/tps or even ~100/tps with just single RTX3090 without struggling all those rocm+pytorch+triton+vllm version matching, patching and rocblas libs chaos, I feel very unbalanced. Am I doing something very stupid on my server setup or it's just fate and punishment for cutting corners to buy AMD card?

Anyway back to analysis, I followed the recipe of a successful repo:

https://arkprojects.space/wiki/AMD_GFX906/vllm/recipes/Qwen3.6-35B-A3B and converted as docker command:

docker run -d \
  --name vllm-gfx906-mixa3607 \
  --network host \
  --ipc host \
  --pid host \
  --privileged \
  --cap-add=SYS_ADMIN \
  --device=/dev/kfd \
  --device=/dev/dri \
  --group-add video \
  --group-add $(getent group render | cut -d: -f3) \
  --volume /sys:/sys:ro \
  --volume $HOME/.triton:/root/.triton \
  -v /media/docker/mount/vllm/models:/models \
  --shm-size=16g \
  -e HSA_OVERRIDE_GFX_VERSION=9.0.6 \
  -e FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" \
  -e VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS="1" \
  mixa3607/vllm-gfx906:0.20.1-rocm-7.2.1-aiinfos \
  vllm serve /models/cyankiwi-Qwen3.6-35B-A3B-AWQ-4bit \
    --served-model-name qwen3.6 \
    --tensor-parallel-size 4 \
    --port 8100\
    --async-scheduling \
    --trust-remote-code \
    --enable-auto-tool-choice \
    --reasoning-parser qwen3 \
    --tool-call-parser qwen3_coder \
    --max-model-len 200000 \
    --data-parallel-size 1 \
    --dtype float16 \
    --gpu-memory-utilization 0.95 \
    --limit-mm-per-prompt '{"image": 20, "video": 4}' \
    --max-num-seqs 16 \
    --enable-expert-parallel \
    --enable-prefix-caching

And I tried to benchmark with following script directly in docker bash, so no api router's overhead:

FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" OMP_NUM_THREADS=4 VLLM_LOGGING_LEVEL=DEBUG vllm bench serve \
--dataset-name random \
--random-input-len 10000 \
--random-output-len 1000 \
--num-prompts 4 \
--request-rate 10000 \
--ignore-eos

And result as follows:

============ Serving Benchmark Result ============
Successful requests:                     4         
Failed requests:                         0         
Request rate configured (RPS):           10000.00  
Benchmark duration (s):                  72.19     
Total input tokens:                      40000     
Total generated tokens:                  4000      
Request throughput (req/s):              0.06      
Output token throughput (tok/s):         55.41     
Peak output token throughput (tok/s):    88.00     
Peak concurrent requests:                4.00      
Total token throughput (tok/s):          609.53    
---------------Time to First Token----------------
Mean TTFT (ms):                          17451.07  
Median TTFT (ms):                        18025.08  
P99 TTFT (ms):                           26242.86  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          54.49     
Median TPOT (ms):                        53.97     
P99 TPOT (ms):                           63.98     
---------------Inter-token Latency----------------
Mean ITL (ms):                           54.49     
Median ITL (ms):                         45.98     
P99 ITL (ms):                            50.17     
==================================================

with 20k ctx:

============ Serving Benchmark Result ============
Successful requests:                     4         
Failed requests:                         0         
Request rate configured (RPS):           20000.00  
Benchmark duration (s):                  96.08     
Total input tokens:                      80000     
Total generated tokens:                  4000      
Request throughput (req/s):              0.04      
Output token throughput (tok/s):         41.63     
Peak output token throughput (tok/s):    76.00     
Peak concurrent requests:                4.00      
Total token throughput (tok/s):          874.24    
---------------Time to First Token----------------
Mean TTFT (ms):                          26404.19  
Median TTFT (ms):                        26443.89  
P99 TTFT (ms):                           40167.30  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          69.37     
Median TPOT (ms):                        69.38     
P99 TPOT (ms):                           82.77     
---------------Inter-token Latency----------------
Mean ITL (ms):                           69.37     
Median ITL (ms):                         55.24     
P99 ITL (ms):                            342.95    
==================================================

Are these numbers looks normal with 4x MI50 setup? Anything I should test or tune? Thank you.

u/simi6a6 — 4 days ago

▲ 1 r/ROCm

9070XT with Ollama

I have 9070XT and want to run Ollama.

I follow this guide https://www.doroch.com/post/ai-on-amd-radeon-rx-9000-local-llm-ollama-rocm-gpt-oss-qwen3/ but it not working

please help.

reddit.com

u/nnthwt — 2 days ago

▲ 2 r/ROCm

W7900 for LM studio and local model

Is it a good shift or possible issue with ubuntu

reddit.com

u/Gloomy_Letterhead395 — 4 days ago

▲ 6 r/ROCm

ROCm support for 780m igpu

Hey guys,

I changed laptop at Christmas and now have a 780m igpu. I want to use ROCm for Pytorch but it's very unstable, I get GPU reset with error MES failed to respond to msg=REMOVE_QUEUE
I've seen on some GitHub issues that's it's been around for more than a year.
Even when using TheRock latest version same thing. I'm using Linux.

Is there any hope I'll get support in a near future or I'm doomed to run on CPU ?
I'm quite disappointed by AMD.

reddit.com

u/Rgoplay_ — 4 days ago