u/wolverinee04

▲ 24 r/ROCm

Tried ROCm 7.1 vs Vulkan/RADV on Radeon 890M for LLM inference (8B and 35B-MoE). Vulkan won both. Why?

Posting because I expected the opposite result and I want to know if I

misconfigured ROCm or if this is the actual state of things on Radeon 890M

class iGPUs.

Hardware: Beelink SER9 Pro, Radeon 890M iGPU (16 RDNA 3.5 CUs), 32GB

LPDDR5x-7500. Ubuntu 24.04, kernel 6.11.

Two backends tested:

  1. ROCm 7.1 — installed via the official AMD repo. gfx1150 target (gfx1100

    binary fallback because gfx1150 isn't fully supported yet). Built

    llama.cpp with -DGGML_HIPBLAS=ON.

  2. Vulkan/RADV — mesa 24.x, llama.cpp (and LMStudio for the bigger model)

    built with -DGGML_VULKAN=ON.

Two workloads:

WORKLOAD A — Gemma 4 E4B Q8_0 (8B dense, full offload, 4K ctx):

- ROCm: ~12.5 tok/s

- Vulkan/RADV: ~16.0 tok/s

WORKLOAD B — Qwen 3.5 35B A3B Q4_K_M (35B MoE, 15–20 of ~48 layers offloaded,

4–8K ctx):

- ROCm: ~14 tok/s (had to fight harder to get this working with partial

offload — LMStudio's ROCm path on gfx1150 was less stable than its

Vulkan path)

- Vulkan/RADV via LMStudio: 20–22 tok/s steady

In both cases, same machine, same model file, same prompt. Power and

thermals were similar between backends — this is throughput, not

heat-throttling.

My read on why:

- gfx1150 (RDNA 3.5) doesn't have first-class kernel support in ROCm 7.1

yet. Falling back to gfx1100 binaries leaves perf on the table.

- The Vulkan backend in upstream llama.cpp got Wave32 flash-attention

+ graphics-queue scheduling patches in early 2026 that haven't landed

in the ROCm path yet.

- For the 890M's iGPU class specifically, the integrated nature means

memory bandwidth dominates, and Vulkan's path through RADV seems better

optimized for shared LPDDR5x access patterns.

- For partial offload specifically, Vulkan handles the GPU-CPU layer

boundary cleaner in LMStudio than ROCm did.

Open questions for the sub:

- Anyone running gfx1150-targeted ROCm builds (not gfx1100 fallback)?

Does perf shift?

- Is the picture different at the Strix Halo 8060S iGPU class? More CUs,

more bandwidth, possibly closer ROCm parity.

- ROCm build flag I'm missing for this iGPU class?

Not trying to dunk on ROCm — I want to use it for the unified-memory story

on iGPUs, but Vulkan is faster on this class today. Curious if that flips

with ROCm 8.x or with bigger silicon.

reddit.com
u/wolverinee04 — 6 days ago
▲ 26 r/AMDLaptops+1 crossposts

HX 370 in a fanless mini-PC chassis: sustained Qwen 3.5 35B A3B perf the laptop chassis can't deliver

The HX 370 in laptop form factors throttles within ~2 minutes on sustained

LLM inference because the chassis can't dissipate the heat. Curious what the

silicon does in a 32 dB fanless mini-PC instead — answer: a LOT.

Test bench:

- Beelink SER9 Pro

- HX 370 (12 cores, Zen 5)

- Radeon 890M (16 RDNA 3.5 CUs)

- 32GB LPDDR5x-7500 dual channel

- Stock cooling — no repaste, no fan curve mods

Workload: 60 minutes uninterrupted LLM inference at 4–8K context.

Model: Qwen 3.5 35B A3B Q4_K_M (35B MoE, ~3B active params per token,

~21GB memory footprint). Backend: LMStudio (llama.cpp+Vulkan under the hood),

15–20 of ~48 layers offloaded to the 890M iGPU.

Sustained numbers across the full hour:

- tok/s: 20–22 ± 0.6. NO degradation curve.

- Package temp: 84–87C steady. Zero thermal-throttle events in dmesg.

- Fan noise: stayed under 32 dB measured at 30cm.

- Power: 56–58W steady. PPT held at platform target.

- Idle return: 12W within ~10s of load ending.

For comparison, on a smaller dense model (Gemma 4 E4B Q8 with full offload

via vanilla llama.cpp Vulkan): ~16 tok/s sustained. Same chassis, same hour.

The HX 370 + 890M combo is genuinely capable of MoE-class inference at

sizes that the laptop chassis throttles to uselessness. From the perf-thread

data points in this sub: typical Strix Point laptops on the same silicon hold

~10–13 tok/s on equivalent workloads because they hit thermal limits and

clock down within 90–120 seconds.

Two takeaways for HX 370 buyers:

  1. The silicon has more sustained performance than any laptop will let youexperience. If you're CPU/iGPU-bound on a Strix Point laptop, the chipisn't the limit — your chassis is.
  2. The 890M iGPU running LLM inference via Vulkan is genuinely useful atMoE 35B-class models with partial offload. ~20 tok/s at that class isnot "tech demo" speed, it's actual-work speed.

Caveats:

- 32GB is soldered. Path to 64GB on this unit is non-existent (and same

story on most current HX 370 laptops).

- Linux RADV story is rock solid for inference. ROCm 7.x technically supports

the 890M but benches slower in my testing.

- For 35B at Q6/Q8 you'd need 64–128GB unified — that's Strix Halo

territory.

Anyone running similar hour-long sustained tests on Strix Point laptops on

the same MoE? I'd love to see the throttle curve on a Framework 16 / G14

HX 370 vs this fanless box for direct chassis-vs-silicon comparison.

youtu.be
u/wolverinee04 — 6 days ago

Two weeks of running Hermes Agent as the daily driver on a local stack.

Sharing the trade-offs because anyone evaluating agent runtimes for local

models is going to hit these.

Underlying model: Qwen 3.5 35B A3B Q4_K_M running on a fanless mini-PC

(Ryzen AI 9 HX 370, Radeon 890M iGPU, 32GB RAM) via LMStudio's Vulkan

backend. ~20–22 tok/s steady at 4–8K ctx. The model is fast enough; this

post is about what the AGENT runtime adds and subtracts.

Three things Hermes Agent does WELL:

  1. Tool-call composition past 5 steps. The earlier runtime I was using

    reliably lost the plot around step 5–6. Hermes holds coherence past 10.

  2. Self-correction. When a tool call returns an error or unexpected schema,

    Hermes retries with a different approach more often than not — the simpler

    runtime would just give up.

  3. Consistency on structured output. CSV / JSON outputs are reproducibly

    clean across runs. The simpler runtime needed ~20% retries to get clean

    output.

Three things Hermes Agent makes WORSE:

  1. Latency per response. Each tool-call round-trip is ~30–40% slower than

    the simpler runtime. Cumulative effect over a 10-step workflow is

    substantial — what was 80s is now ~120s.

  2. Context budget. Hermes injects ~8K of system prompts + tool definitions

    into every call. On a model with 32K context, you're effectively working

    with ~24K of usable conversation context. Shows up as earlier truncation

    on long agent sessions.

  3. Setup complexity. The simpler runtime's config was 3 lines. Hermes is a

    real config file with several tuning knobs.

Three real workloads I'm running 24/7:

A) Daily AI-news brief (cron 7 AM): SearXNG + summary + markdown dump.

~70 seconds with Hermes; was ~50 with the simpler runtime. But the

summaries are noticeably tighter — fewer "AI told me three points

incoherently" outputs.

Heartbeat scraper: 5 sites, daily diff, log append. ~20 seconds.

No quality difference vs simpler runtime here — workload is too small

to expose Hermes's planning advantages.

C) Ad-hoc structured scrapes: "Get last 10 releases, dump to CSV." ~90s.

Quality clearly better — fewer field-naming inconsistencies, fewer

missed breaking-change flags.

The verdict for me: the latency cost is worth it for the planning + retry

quality on multi-step workloads. NOT worth it for short, deterministic

workloads where the simpler runtime is faster and equally accurate.

Heuristic I'm using: if the workload is >5 tool calls deep OR involves

self-correction, Hermes wins. Otherwise, fall back to a lighter runtime.

What agent runtimes are you all using on local models? Curious especially

if anyone's run Hermes Agent against the new agent frameworks (the OSS

community has been shipping fast lately) on the same hardware + model.

reddit.com
u/wolverinee04 — 6 days ago

7 days running Qwen 3.5 35B A3B on a fanless mini-PC iGPU as a 24/7 personal AI agent : what works, what doesn't

Sharing two weeks of real use because the "can a 35B-MoE actually be a

daily-driver on consumer hardware" question keeps coming up.

Stack:

- Hardware: Beelink SER9 Pro (Ryzen AI 9 HX 370, Radeon 890M iGPU, 32GB

LPDDR5x-7500). Fanless 32 dB, ~12W idle.

- Model: Qwen 3.5 35B A3B Q4_K_M (35B-param MoE, ~3B active per token).

~21GB total memory footprint with KV cache.

- Inference: LMStudio with Vulkan backend. 15–20 of ~48 layers offloaded to

the iGPU (~33–42% offload). Rest on CPU. Steady 20–22 tok/s at 4–8K ctx.

- Agent: Hermes Agent driving the model through LMStudio's OpenAI-compatible

endpoint.

- Search: self-hosted SearXNG via Docker for private web search.

Three workloads I tested at length:

  1. Daily news brief (cron, 7 AM):

    - Hermes queries SearXNG for top AI stories last 24h, model summarizes each

into ~2 sentences, output saves as dated markdown.

- Time per run: ~50–70s (slower than the Gemma 4 E4B version because of

Hermes Agent overhead, but quality is better).

- Reliability over 7 days: 7/7 ran cleanly.

  1. Heartbeat scraper:

    - Daily, hits 5 sites, logs diffs.

    - Time per run: ~15–20s. Tokens: ~250.

    - Reliability: 7/7. No false positives, two genuine catches.

  2. Ad-hoc structured scraping:

    - "Pull the last 10 GitHub releases of OpenClaw, give me version + date +

key changes + breaking changes flag, dump to CSV."

- Time: ~90s. Tokens: ~2000.

- Output: clean CSV, no manual cleanup. The breaking-changes flag was

subjective and the model called it correctly 8/10 times.

Where Qwen 3.5 35B A3B Q4_K_M visibly struggles:

- Hard math past 5–6 step proofs. Q4 hurts here.

- Long-context summarization (>20K input). The model's effective ctx for

agent work is constrained by Hermes injecting ~8K of system prompts +

tool defs into the budget.

- Code generation past ~150 LOC. Loses coherence on bigger refactors.

Tok/s curve I measured:

- 0–4K ctx: 20–22 tok/s

- 4–8K ctx: 19–21 tok/s

- 16K ctx: ~17 tok/s

- 24K ctx: ~14 tok/s (and TTFT becomes painful — the partial offload means

prompt processing is CPU-bound)

Power numbers (running 24/7):

- Idle: ~12W

- Inference burst: ~58W

- 7-day average: ~18W

- ~$3.50/mo on US-typical electricity rates

Compared to the Gemma 4 E4B Q8 daily-driver setup I was running before:

- Qwen 35B A3B is noticeably more capable on agent tool-call loops and

multi-step planning.

- Tok/s is similar (Gemma 16, Qwen 20–22 — Qwen is faster on this hardware

because MoE active params are tiny).

- Memory pressure is much higher — 21GB vs 8GB. If I want to run anything

alongside the agent, Qwen pushes it.

Anyone running Qwen 3.5 35B A3B as a daily-driver agent? Curious especially

if anyone's on Strix Halo (8060S, 128GB unified) — does full offload at that

class beat partial offload at the 890M class, and is it worth the chassis +

cost step-up?

reddit.com
u/wolverinee04 — 6 days ago
▲ 6 r/ollama

Posting because anyone with similar hardware (AMD APU + iGPU inference) is

likely hitting the same wall I did. Caveat upfront: this test is specifically

Gemma 4 E4B. I switched to llama.cpp+Vulkan and didn't retest Ollama on the

bigger models I'm now running, so this finding is scoped to that one model.

Setup:

- Ryzen AI 9 HX 370, Radeon 890M iGPU, 32GB LPDDR5x-7500 (Beelink SER9 Pro).

- Gemma 4 E4B Q8_0 (8B dense), full offload.

- Ubuntu 24.04, kernel 6.11, mesa 24.x with RADV.

Numbers (steady state, 4096 ctx, FA on, batch 1):

- Ollama 0.6.x default backend: ~6.4 tok/s

- Same model GGUF, llama-server (built ggml-org/llama.cpp with -DGGML_VULKAN=ON):

~16 tok/s

Why the gap (best read from the GH discussions):

The vendored llama.cpp inside Ollama is months behind upstream. The Wave32

flash-attention and graphics-queue patches that landed upstream in early 2026

haven't been pulled into Ollama's vendored copy yet. There's an open tracking

issue in the Ollama repo (search "AMD Vulkan" in issues — the gap is

documented).

What I'm still using Ollama for:

- Quick "ollama pull X && ollama run X" sanity checks on new model releases.

The DX is unbeatable for that.

- Mac mini for personal work (M4 — Metal does the heavy lifting and Ollama is

great there because the Vulkan/ROCm story doesn't apply).

What I switched to for the agent loop:

- llama-server when I want hand-tuned full-offload setups.

- LMStudio when I want easier layer-offload tuning for bigger MoE models that

don't full-offload (testing Qwen 3.5 35B A3B with partial offload right now —

20–22 tok/s with 15–20 of ~48 layers on the iGPU).

Setup for the llama-server path is genuinely 3 commands:

git clone https://github.com/ggml-org/llama.cpp

cmake -B build -DGGML_VULKAN=ON && cmake --build build -j

./build/bin/llama-server -m gemma-4-e4b-it.Q8_0.gguf --host 0.0.0.0 -ngl 99

Wrap in systemd if you want auto-start. Done.

If anyone here has tested Strix Halo (16-CU 8060S, 128GB unified LPDDR5x)

on Ollama vs upstream llama.cpp, I'd love to see if the gap closes at the

bigger APU class. Also curious if anyone's running Ollama on Qwen 35B-class

MoEs with partial offload — does Ollama support that as cleanly as LMStudio

does these days?

u/wolverinee04 — 6 days ago

TL;DR:

Q4_K_M Qwen 3.5 35B A3B running at 20–22 tok/s steady at 4–8K context on a

Beelink SER9 Pro (Ryzen AI 9 HX 370, Radeon 890M iGPU, 32GB LPDDR5x). Only 15–20

of ~48 layers offloaded to the iGPU — the rest on CPU. ~21GB total RAM use.

Setup: LMStudio with the Vulkan (RADV) backend. LMStudio is just llama.cpp under

the hood, but the layer-offload slider is way easier to tune than rebuilding

llama.cpp every time you want to test a new offload ratio.

Why this is interesting: most people assume you need full GPU offload for this

class of model. You don't, at least on iGPU + LPDDR5x systems where the

"GPU memory" is just system RAM anyway. Partial offload at ~30–40% of layers

hits the sweet spot — enough compute on the iGPU to amortize the matmuls, not

so much that you're fighting bandwidth.

The MoE architecture helps a lot. Active params per token are ~3B (out of 35B),

so per-token compute is small even though the model footprint is big. The 890M

handles the active expert just fine.

For comparison, on the same hardware:

- Gemma 4 E4B Q8 (8B dense, full offload via vanilla llama.cpp Vulkan): ~16 tok/s

- Qwen 3.5 35B A3B Q4_K_M (35B MoE, partial offload via LMStudio's Vulkan): 20–22 tok/s

Yes, the bigger MoE model is FASTER than the smaller dense one on this hardware.

That surprised me.

Separate finding from earlier testing — Ollama on Gemma 4 E4B (full offload): ~6.4 tok/s.

Same model, same machine, same quant. The vendored llama.cpp inside Ollama is

behind upstream's Wave32 FA + graphics-queue patches that landed in 2026. I

didn't retest Ollama on Qwen 35B because LMStudio's Vulkan path was already

working, but I'd expect a similar gap on AMD APUs.

Caveats:

- Q4_K_M loses some quality vs Q6/Q8. For agent tool-call workflows it still

hits its function-calling targets reliably; for harder reasoning tasks, you

feel the quant.

- Time-to-first-token at long context (16K+) gets slower because prompt

processing on partial offload is bottlenecked by the CPU layers. Generation

speed holds; TTFT degrades.

- I'm using Hermes Agent as the runtime now (swapped from OpenClaw). It's more

capable but slower per response — framework overhead — and its system prompts

+ tool definitions eat ~8K of the model's context budget. So if your Qwen

setup advertises 32K context, expect ~24K usable for actual conversation

under Hermes. Trade-off worth knowing.

The Qwen 35B A3B + Hermes Agent migration is going into a follow-up.

Has anyone tested Qwen 3.5 35B A3B on Strix Halo (8060S iGPU, 128GB unified

LPDDR5x)? Curious if full offload is even useful at that class or if partial

still wins.

reddit.com
u/wolverinee04 — 6 days ago