r/StrixHalo

▲ 960 r/StrixHalo+1 crossposts

Don't know if the date was released yet, but this was just said a few moments ago at AMD AI Dev Day. No word on price, but I think its made by Lenovo based on the plug earlier in the presentation.

Edit: They had a unit on a table and I just confirmed with an engineer it is just a 395 128gb with no changes.

u/1ncehost — 13 days ago
▲ 481 r/StrixHalo+1 crossposts

vLLM ROCm has been added to Lemonade as an experimental backend

vLLM has the ability to run .safetensors LLMs before they are converted to GGUF and represents a new engine to explore. I personally had never tried it out until u/krishna2910-amd/ u/mikkoph and u/sa1sr1 made it as easy as running llama.cpp in Lemonade:

lemonade backends install vllm:rocm
lemonade run Qwen3.5-0.8B-vLLM

This is an experimental backend for us in the sense that the essentials are implemented, but there are known rough edges. We want the community's feedback to see where and how far we should take this. If you find it interesting, please let us know your thoughts!

Quick start guide: https://lemonade-server.ai/news/vllm-rocm.html GitHub: https://github.com/lemonade-sdk/lemonade Discord: https://discord.gg/5xXzkMu8Zk

u/jfowers_amd — 5 days ago
▲ 106 r/StrixHalo+1 crossposts

Luce DFlash + PFlash on AMD Strix Halo: Qwen3.6-27B at 2.23x decode and 3.05x prefill vs llama.cpp HIP

Hey fellow Llamas, keeping it short.

We just shipped DFlash and PFlash support for the AMD Ryzen AI MAX+ 395 iGPU (gfx1151, Strix Halo, 128 GiB unified memory). Same Luce DFlash stack from the RTX 3090 post a couple weeks back, now running on the consumer AMD APU class.

Repo: https://github.com/Luce-Org/lucebox-hub (MIT)

TL;DR

End-to-end on Qwen3.6-27B Q4_K_M with the Luce Q8_0 DFlash drafter: 26.85 tok/s decode and 20.2 s prefill at 16K context.

That is 2.23x faster decode and 3.05x faster prefill than llama.cpp HIP on the same silicon. At a 16K prompt + 1K generation workload, total wall clock drops from 147 s to 58 s, 2.5x faster end to end.

The same 128 GiB box hosts checkpoints up to ~100 GiB, a class of models a 24 GiB consumer GPU cannot touch (Qwen3.5-122B-A10B, MiniMax-M2.7-REAP 139B-A10B, full BF16 27B).

The numbers

Hardware: Ryzen AI MAX+ 395, Radeon 8060S iGPU (gfx1151), 128 GiB LPDDR5X-8000, ROCm 7.2.2 Target: Qwen3.6-27B Q4_K_M (15.65 GiB) Drafter: Lucebox/Qwen3.6-27B-DFlash-GGUF Q8_0 with DFLASH27B_DRAFT_SWA=2048 Bench: 10-prompt HumanEval-style, --n-gen 128 --ddtree-budget 22 --fast-rollback

Decode (Qwen3.6-27B Q4_K_M, tok/s):

Engine tok/s vs AR
llama.cpp HIP AR 12.02 1.00x
llama.cpp Vulkan AR 12.45 1.04x
Luce DFlash (this PR) 26.85 2.23x

Prefill (Qwen3.6-27B, 16K tokens):

Engine TTFT vs AR
llama.cpp HIP AR 61.69 s 1.00x
Luce PFlash 20.2 s 3.05x

Speedup grows with context: PFlash compress is O(S), AR prefill is O(S^2). NIAH retrieval still passes at 16K.

Tuning note: --ddtree-budget=22 is the gfx1151 optimum. Higher budgets accept more tokens per step but each step gets more expensive on LPDDR5X. Bandwidth caps the benefit before tile utilization pays off. Contrast with gfx1100 (7900 XTX, GDDR6 936 GB/s) where budget=8 wins, tile waste matters more than launch amortization. Default ship is arch-aware.

Reproduce

bash

# 1. Build PR #119 for gfx1151
git clone https://github.com/Luce-Org/lucebox-hub.git
cd lucebox-hub
git fetch origin pull/119/head:pr119 && git checkout pr119
git submodule update --init --recursive
cd dflash
cmake -B build -S . \
  -DCMAKE_BUILD_TYPE=Release \
  -DDFLASH27B_GPU_BACKEND=hip \
  -DDFLASH27B_HIP_ARCHITECTURES=gfx1151 \
  -DDFLASH27B_HIP_SM80_EQUIV=ON
cmake --build build --target test_dflash -j

# 2. Models: Qwen3.6-27B target + Lucebox Q8_0 DFlash drafter
mkdir -p models/draft
hf download unsloth/Qwen3.6-27B-GGUF Qwen3.6-27B-Q4_K_M.gguf --local-dir models/
hf download Lucebox/Qwen3.6-27B-DFlash-GGUF dflash-draft-3.6-q8_0.gguf --local-dir models/draft/

# 3. Bench (DFlash decode + PFlash long-context prefill)
LD_LIBRARY_PATH=/opt/rocm/lib:$LD_LIBRARY_PATH \
DFLASH_BIN=$PWD/build/test_dflash \
DFLASH_TARGET=$PWD/models/Qwen3.6-27B-Q4_K_M.gguf \
DFLASH_DRAFT=$PWD/models/draft/dflash-draft-3.6-q8_0.gguf \
DFLASH27B_DRAFT_SWA=2048 \
DFLASH27B_PREFILL_UBATCH=512 \
  python3 scripts/bench_he.py --n-gen 128 --ddtree-budget 22

DFLASH27B_PREFILL_UBATCH=512 applies the PR #159 fix on top of PR #119. Once #159 merges, this is the daemon default.

What is still missing

  • BSA scoring kernel on HIP. The drafter compress-score path uses BSA (block-sparse attention) on CUDA. PR #119 disables it on HIP and falls back to ggml's flash_attn_ext, which the daemon's own warning flags as ~3.4x slower. A rocWMMA-native sparse-FA kernel closes the gap. After it lands, PFlash TTFT at 16K drops from 27.6 s to roughly 8 s. At 128K, projected 7-10x over llama.cpp AR.
  • Multi-row q4_K decode GEMV. RDNA-native multi-row pattern (R=4-8 output rows sharing activation register state) for the drafter forward, currently 30% of compress time at long context.
  • Phase 2 tile shape tuning for gfx1151. Current rocWMMA flashprefill tiles are tuned for gfx1100. Strix Halo has different LDS and VGPR characteristics.
  • 70B+ MoE targets. 128 GiB headroom is wasted on a 27B. Qwen3.5-122B-A10B and MiniMax-M2.7-REAP 139B-A10B both fit. DFlash math ports cleanly to MoE; big work is wiring the expert-routed forward into the spec verify loop.

Constraints

ROCm 7.2.2+, gfx1151 tuned (gfx1100 also supported with arch-aware defaults), greedy verify only, no Vulkan / Metal / multi-GPU on this path yet.

We're working hard on this but we know we need to improve on many things.

Feedback is more than welcome :)

u/sandropuppo — 1 day ago

Hermes and Lemonade on Framework Desktop (trip report by a newbie)

Here's a quick report from an LLM newbie about my experience running Hermes locally.

Me: retired software architect (database internals, Linux sysadmin, containers and Kubernetes) - 40+ years in the biz, retired a year ago. Little knowledge about LLMs and such until a month or so ago.

Hardware: Framework Desktop 128 GB. (Strix Halo)

I am running Hermes on Fedora pointed at my local containerized Lemonade server. It's currently using Qwen3.5-27B-GGUF as the only model.

It took a few false starts to get it set up and running, but it's actually starting to do some useful work. 

Problems getting it to work:

• The containerized version of Hermes was causing me trouble (which I cannot describe, might have been user error), so I am now running Hermes non-containerized.

• Lemonade and the model are configured for "thinking". This caused Hermes to generate totally obscure "connection error" messages when it tried to connect to Lemonade. That made me think that there was a DNS issue or a firewall problem, but in fact it was just that Hermes didn't like all the "thinking" data that Lemonade was throwing it at.  (I would have NEVER figured that out; free Claude actually did.)  To fix this I had to add an entry to my Lemonade docker compose file to set LEMONADE_LLAMACPP_ARGS=--chat-template-kwargs '{"enable_thinking":false}' 

• The model didn't have a large enough default context size. I had to add the following to Lemonade's /root/.cache/lemonade/recipe_options.json file:

 "Qwen3.5-27B-GGUF": {

   "ctx_size": 128000

 }

• Once these were sorted Hermes started up and kinda worked. I ran it with a couple of models, including Llama-3.2-3B-Instruct.GGUF and Qwen3-14B-GGUF.  The llama model kinda worked but had some issues. The 14B Qwen model frequently went into crazy loops when I asked it to do some basic tasks - it was incapable of setting up a cron job in Hermes, for example, and spun out trying to do so. 

• I am reasonably familiar with systemd, I thought, but had never been exposed to user scoped systemd services and timers and such. Hermes uses them in a way which feels quite cool, but which was a deep mystery to me when I started 48 hours ago. (That's a me problem, not a Hermes problem.)

• Setting up Discord messaging was complicated, but on the Discord side. Trying to figure out how to create the bot and get the necessary ids was a wee bit tricky but I got it done. (Again probably a me problem.)

What It's doing successfully so far (in less than a day)

Once I got it up and running with 27B I had it do a couple of things:

• I gave it a list of hosts to ping every 30 minutes and if anything is down or goes up instructed it to notify me via discord. It wrote a pretty nice script to do that. It then forgot to put it into a cron job, which I discovered the next day - but when I pointed that out it fixed it easily.

• I have it another list of hosts to ssh to every 30 minutes and monitor in the same way. It figured that out easily enough but the script messaged me with the details every 30 minutes whether anything was wrong or not. I pointed that out and it fixed it.

• I asked it if it could monitor ZFS filesystems, and it said that it could. It wrote a script which reported a warning that wasn't anything wrong. I reported that and it fixed it after a few false starts.  

Performance:

Hermes replies to simple messages in 30 seconds. It replies to more complicated things in a minute or so. It took it maybe 4 or 5 minutes to fix one of the scripts that it wrote, for example.  

Lemonade is properly configured to use the Strix Halo GPU (which was another horror story for another day). The llm server is running about 50% of one CPU while Hermes is using it (GPU is near 100%).

For example, I just asked it "what is the weather forecast for tomorrow for Roseville, California" which I had never asked it before. It nattered around for nearly 2 minutes before giving me the forecast from wttr.in (which is wildly different than that the National Weather Service says, but that's probably not a Hermes problem ... other than why it chose to look at wttr.in in the first place).

Summary

• You can absolutely do some stuff self-hosted with Hermes + Lemonade on Framework Desktop

• It's not fast

• It's not perfect

• It put together a monitoring solution that I had been meaning to do by hand for a month in a couple of hours (yes, I should have just been using Uptime Kuna).

• It'll be a while before I trust it with anything important

• But it's seems somewhat useful and fun

Next Steps: 

• Point it at my Home Assistant?

• Voice?

• TBD

reddit.com
u/Sjsamdrake — 1 day ago
▲ 111 r/StrixHalo+1 crossposts

Running Minimax 2.7 at 100k context on strix halo

Just wanted to share because it took me a lot of tweaking to get here:

llama-server -hf unsloth/MiniMax-M2.7-GGUF:UD-IQ3_XXS --temp 1.0 --top-k 40 --top-p 0.95 --host 0.0.0.0 --port 8080 -c 100000 -fa on -ngl 999 --no-context-shift -fit off --no-mmap -np 2 --kv-unified --cache-ram 0 -b 1024 -ub 1024 --cache-reuse 256

Reasoning behind the various options

--no-context-shift I want to know when I run out of context instead of silently corrupting stuff

--no-mmap Recommended by Donato

-np 2 Retain context for up to two concurrent sessions

--kv-unified Make the two session share the same cache to save vram

--cache-ram 0 Do not swap cache to ram, stays in vram instead. This solved a lot of OOMs for me.

-b 1024 -ub 1024 Improve prefill performance.

--cache-reuse 256 Attempt to reuse cache "smartly". This sometimes helps avoid having to reprocess cache but also sometimes hurts, so use at your own discretion.

Additional setup

Headless Fedora Linux according to Donato's setup guides (but sans-toolbox). I also recommend increasing your swap size and setting OOMScoreAdjust=500 in your systemd service file, otherwise, you risk the oom killer killing important things if you do run out of ram.

Intelligence

I've found minimax to be great at coding but not necessarily as "well rounded" as Qwen3.6 27b. It's not as strong at coding architecture discussions or code review. Qwen may also be stronger at non-coding stuff.

Where minimax shines is in coding "intuition", it "just gets you". When Qwen would take things too literally or fail to get the gist of things, Minimax better understands "intent". It may also have more "knowledge" than Qwen 27b due to having more parameters.

Performance

https://preview.redd.it/695zwpa6660h1.png?width=1000&format=png&auto=webp&s=c4a584f1aa9e2e8c406f44194097f66ce86cce13

https://preview.redd.it/2ojq0ts7660h1.png?width=1000&format=png&auto=webp&s=029f583fb4344be00c3681cf3a24722cf59123c7

EDIT Look_0ver_There suggested I add a little disclaimer that this only works for "concurrency = 1" scenarios. Because we're using --kv-unified, if you have concurrent requests, the second request has a chance of poisoning the cache of the first session.

reddit.com
u/Zc5Gwu — 4 days ago
▲ 6 r/StrixHalo+1 crossposts

Questions about moving over to Linux from Windows for a Linux Newbie (I work in IT but always used Windows and only ever tinkered with Linux on Raspberry pi years ago)

Hi

Lots of previous discussions have suggested that instead of Windows 11 I try Linux to get better Local LLM speeds on my Corsair AI Workstation 300 with AMD Ryzen AI Max+ 395 and 128GB RAM

I have some questions if you don't mind so I can make sure I do all of this correctly, as some of my initial tests didn't go so well (see bottom of post):

1) Choice of Distro?

Ubuntu or Fedora

2) Shared VRAM settings in grub and BIOS

A lot of sites say about setting ttm.pages_limit and amdgpu.gttsize

Options seem to be:

a) editing grub and adding:

amd_iommu=off amdgpu.gttsize=131072 ttm.pages_limit=33554432

or

b) install AMD Tools and using amd-ttm to set the shared vram

sudo apt install pipx

pipx install amd-debug-tools

amd-ttm

amd-ttm --set 100

A lot of sites I found the articles are older, so what is the current best way to do this and what should I set both via these settings and in BIOS?

3) ROCM or Vulkan?

Do I use ROCM or Vulkan with Ollama / LM Studio / Lemonade etc?

And if so best way to install / configure e.g. if Ollama what envionment variables need setting

Previous tests and issues

I initially installed Ubuntu 26.04 but had issues with ROCM drivers and found lots of posts about 24.04 being better choice, so installed that

Running models in Ollama seemed to work with Vulkan after adding below:

Environment="OLLAMA_VULKAN=1"
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="ROCR_VISIBLE_DEVICES="

But without the Environment="ROCR_VISIBLE_DEVICES=" entry I got errors trying to use models with ROCM:

ollama run llama3.3:70b
Error: 500 Internal Server Error: llama runner process has terminated: cudaMalloc failed: out of memory
error loading model: unable to allocate ROCm0 buffer
panic: unable to load model: /usr/share/ollama/.ollama/models/blobs/sha256-4824460d29f2058aaf6e1118a63a7a197a09bed509f0e7d4e2efb1ee273b447d

I then tried LM Studio and it worked fine with Vulkan set as Runtime, but with ROCM I just keep getting "Failed to load model" with no further error info

I also tried Lemonade-Server and that again works with Vulkan but not ROCM

Summary

So what was I doing wrong in initial tests and based on answers to my questions what is my best option to get the best model performance on this system.

Thanks for reading - sorry it is a long post, but wanted to give all detail possible

Anything else you need to know to help then just asks

reddit.com
u/wingers999 — 1 day ago
▲ 195 r/StrixHalo+1 crossposts

Before and after of my homelab

  • Bosgame M5 (Strix Halo 128GB Ram, runs Proxmox, a bunch of LXCs and Docker containers and Qwen 3.6 a35b-a3b Q8 with 60 t/s at full context.)
  • 2x 4bay USB3 thingies with 6x 26TB and 2x 16TB
  • Tec Mojo 10" Rack with 12U
  • Flint 2 WLAN router (runs Docker with Portainer and PiHole)

It'l all end up in the same place the mess was before, this placement is just temporary for installation.

edit #1: 60 instead of 80 t/s.

u/tecneeq — 5 days ago

Strix Halo plus R9700 eGPU, Fedora 44. Best of both worlds.

I recently connected an R9700 to my Strix Halo. On Fedora 44 it was very easy. iGPU is rendering the OS to save vram in the R9700. I am using llama.cpp toolbox for the iGPU and using HIP_Visible_Devices to target the right gpu. The R9700 feels lightning fast, speed does fluctuate, but Qwen3.6 35B q4-k-m PP 2100 and TG 87.

Some possible uses would be have a big slow 27B on the iGPU to create plans and perform reviews and have the fast R9700 execute the plans. You could assign different agents to separate GPUs and work concurrently without any slowdown. If you need someone to talk to you can still load a chat model on the NPU to keep you busy while your agents work.

There isn’t much option as far as I know for software to take advantage of this set up, but I’ll start with Open-Notebook and see what else I can find. Send me any ideas you have for software or workflow.

reddit.com
u/I-will-allow-it — 3 hours ago
▲ 23 r/StrixHalo+2 crossposts

fine-tuning 27B hybrid models on strix halo (ryzen ai max+ 395 / gfx1151, 128 gb unified memory) — full guide, patches, orchestrator

Sharing a guide I just published for fine-tuning 27B+ LLMs on AMD Strix Halo (Ryzen AI MAX+ 395, Radeon 8060S / gfx1151, 128 GB unified memory). MIT licensed.

Repo: https://github.com/h34v3nzc0dex/strix-halo-llm-finetune-guide

None of the individual pieces are novel — kernel patches, ROCm 7.13 nightly, FLA, bitsandbytes, LoRA, llama.cpp. The intersection (Strix Halo + gfx1151 + FLA + Qwen3.5 hybrid at 27B) isn't documented anywhere I could find, and getting it stable took a lot of dead ends I'd rather other people skip.

Stack tested: kernel 6.19.14, PyTorch 2.11.0+rocm7.13.0a20260506, ROCm 7.13 nightly, FLA 0.5.1 patched, bitsandbytes 0.50.0.dev0 built from source for gfx1151, llama.cpp b867+. Hardware: Corsair AI Workstation 300 (Sixunited AXB35-02 board, BIOS 3.07).

Things the guide actually covers that I had to figure out the hard way:

  • PyPI bitsandbytes ships zero ROCm binaries. From-source build with -DROCM_VERSION=83, plus a runtime symlink libbitsandbytes_rocm83.so → libbitsandbytes_rocm713.so so bnb's HIP detection on PyTorch 2.10/2.11 stops complaining.
  • FLA's Triton kernels crash on gfx1151 (RDNA 3.5) with num_warps > 4 (Triton#5609) and a tl.cumsum + tl.sum codegen interaction (Triton#3017). Idempotent re-patch script included.
  • In-process Trainer eval at 27B / 8192 seq length is structurally broken on unified-memory APUs — either kernel TTM page allocation failure from fragmentation, or memory watchdog SIGKILL when free RAM drops under ~8 GB. Eval is moved out-of-process via a bash orchestrator aligned to save_steps, waiting for full GPU release between train and eval, with a JSONL trend log.
  • Mainline kernel .deb run-parts double-dir bug on Ubuntu 24.04+ leaves packages half-configured. Repack script included.
  • /srv perms regressing to 0750 mid-training breaks importlib.metadata path traversal and crashes TRL's create_model_card. Cron watchdog restoring 755.

Verified result: in-progress production fine-tune of Qwen3.5-27B (hybrid, 16 full-attention + 48 GatedDeltaNet layers), bf16 LoRA r=128/α=256, eval rolling at 0.13 loss / 96.5% token accuracy, ~11 min/step, ~4-day total runtime.

Feedback and issues welcome, especially from people on different AXB35-02 boards or non-Corsair Strix Halo systems — I'd like to know what's board-specific vs. generic.

https://preview.redd.it/8i3ebs27h00h1.jpg?width=649&format=pjpg&auto=webp&s=1a4fe453e9e46c97b71a14b993b9536288169ca1

reddit.com
u/Outrageous_Bug_669 — 1 day ago
▲ 87 r/StrixHalo+1 crossposts

Some of you saw our post a couple weeks back about hitting 102 tok/s stable on Qwen3.5-35B on a DGX Spark. A lot of you asked "cool, where's the code?" Today's the day: Github

Atlas is open source. Pure Rust + CUDA, no PyTorch, no Python runtime, ~2.5 GB image, <2 minute cold start. We rewrote the whole stack from HTTP handler to kernel dispatch because the bottleneck on Spark wasn't the silicon, it was 20+ GB of generic Python machinery sitting between your prompt and the GPU. We need community support to keep elevating Atlas for developers.

Numbers on a single DGX Spark (GB10):

Qwen3.5-35B (NVFP4, MTP K=2): 130 tok/s peak, ~111 tok/s sustained → 3.0–3.3x vLLM at testing time

Qwen3.5-122B (NVFP4, EP=2): ~50 tok/s decode

Qwen3-Next-80B-A3B (NVFP4, MTP): ~87 tok/s

Nemotron-3 Nano 30B (FP8): ~88 tok/s

Full model matrix on the site (Minimax2.7, Qwen3.6, Gemma too!)

What's actually different:

Hand-tuned CUDA kernels for Blackwell SM120/121 meaning attention, MoE, GDN, Mamba-2. No generic fallbacks.

Native NVFP4 + FP8 on tensor cores

MTP (Multi-Token Prediction) speculative decoding for up to 3x throughput on decode

OpenAI + Anthropic API on the same port, works with Claude Code, Cline, OpenCode, Open WebUI out of the box

Try it (two commands):

docker pull avarok/atlas-gb10:latest
sudo docker run -d --name atlas --network host --gpus all --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  avarok/atlas-gb10:latest serve Qwen/Qwen3.6-35B-A3B-FP8 \
  --port 8888 --speculative --enable-prefix-caching

What's next especially for the non-Spark folks: we're working with Spectral Compute on a Strix Halo port, and AMD is giving us hardware to do it properly. RTX 6000 Pro Blackwell is also on the roadmap. Same kernel philosophy, adapted per chip, we'd rather do four chips well than twenty chips badly.

X/Twitter
Site
Discord

Will be in comments all day. Hit us with edge cases, weird models, broken configs. The roadmap is genuinely community-driven. MiniMax M2.7 landed because someone in Discord asked.

u/Live-Possession-6726 — 5 days ago

Looking for better model for debugging large code bases than Qwen3.5-122B-A10B-UD-Q5_K_XL on Strix Halo.

Has anyone found a stronger model than Qwen3.5-122B-A10B-UD-Q5_K_XL for difficult code debugging in a large project to run on Strix Halo?

My typical workflow with openclaw agents is:

  1. GLM 5.1 or Deepseek or other similar (Ollama:Cloud) plans my project, writes design docs, loads gitlab up with a hundred prioritized issues
  2. Qwen3.6-27B-UD-Q5_K_XL (Radeon 9700 AI Pro) implements issues and submits merge requests (have two of these agents running each with their own GPU)
  3. Qwen3.5-122B-A10B-UD-Q5_K_XL (Strix Halo 128GB) reviews, merges, debugs, polishes

Sometimes I run into issues that neither of my two local models can solve after ollama cloud usage limits are zeroed out. Is there a stronger model that can run on the strix halo than the 122B-A10B if I don't care at all about the speed? I can let it go overnight, just want to have a local way to solve harder problems when this comes up.

Appreciate any ideas. It can be hard to cut through the noise to find the right things.

--------------------------------------------------------------------------------------------
Also here is my launch command (Fedora, llamacpp, rocm 7.2) if any bored people can offer suggestions to improve.

./build/bin/llama-server

--model "Qwen3.5-122B-A10B-UD-Q5_K_XL"

--alias "halo" --host 0.0.0.0 --port 5000 -ngl 999

--flash-attn 'on' -dio

-c 131072

-b 1024

-ub 4096

--parallel 1

--cache-prompt

--cache-type-k q8_0

--cache-type-v q8_0

--log-prefix --jinja --no-mmap --metrics

reddit.com
u/tracker_11 — 3 days ago

Just replaced the thermal compound on my EVO-X2.. results astonishing

Hey everyone,

I bought an EVO-X2 (128gb) about a month ago and have been running it hard since then (primarily LLMs and large software builds). I was noticing since day one that the CPU was spiking very quickly to 95C even with the fans at 100%. It rarely broke 97C so I assume it was throttling.

I decided to replace the thermal compound (with polar therm x10 fwiw though I'm sure anything is better than what they used) and temperatures plummeted. I just ran a benchmark (s-tui) that would normally drive it immediately to 95C (as in under a second), and it never broke 80C, even at 4.5GHz, 100% on all cores, for one minute.

I was nervous as hell doing this to a $5k machine, and it was a pain carefully cleaning up the old goop with 99% isopropyl alcohol, but man was it ever worth it. Took me about an hour when it was all said and done, but this is the first time I've done something like this.

I think GMKtec really needs to work on their application technique, or at least QA.

If you're seeing similar CPU temps under moderate load, just know that it's not normal. Or, at least not necessary.

edit cut my compile time of llama.cpp from over 5 minutes down to 1:34, max temp 85C. Unreal.

reddit.com
u/tired514 — 6 days ago

Strix Halo Clustering experience (Bossgame M5)

Hey there!

I recently got into the local hardware game with the Strix Halo (bosgame m5), ever since buying the hardware it went up in price by some 10~20% in 2 weeks.

I'm now thinking that it would be good to buy another one and cluster the two nodes to run bigger models before prices go up further.

I am an enterprise user working on sensitive code so local hosting of the model is the only way to use LLMs in my field of work.

Does anybody have experience with clustering tools for running models across multiple nodes?

The real motivation that I see behind this approach is the fact that I would have 256 GB of ram rather then 128 GB, based on reading some bartowski quants on hugging face, the models I would be able to run would be:

128 GB:

- Minimax 2.7 high q3 quant with small context

- q1/q2 version of GLM 4.7 (NOT Flash)

- q3 ish qwen 3.5 ~400b

Meanwhile with two systems, potentially:

256gb:

- Minimax q4 2.7 with decent context

- q4 of GLM 4.7

- q1/2 of GLM 5.1 (maybe higher with some REAP version)

- q4 of Qwen 3.5 ~400b

Yes I get it, qwen 3.6 27b is good, yes gemma is good, but for real agentic work and actually getting things done, I was not that happy with just those models that are in the ~32/64gb range.

What I want to find out is:

  1. What methods you can use for clustering?

1.1) I have seen people using thunderbolt networking which would be a nice option, but the protocol itself has very high latency due to the wrapping of the data packet into the thunderbolt layer, and as far as my understanding goes, there is still no option for RDMA over thunderbolt on strix halo as there is with MAC Studios.

1.2) I have also seen people use M2 NVME adapters to networking/ Oculink, this is a feasible approach but I would need to run a high speed network card at each of the strix halos.

1.2.1). Would 50Gig networking be good for the interconnect? Can i do 100 Gig? Over those Nvidia DGX spark connectors?

1.2.2) What is the achievable speed? And whats the ltency ( I know its limited by the M2 slot with something like pice gen 4 speeds from the 4x4 slot), but is it slower in reality?

1.3) Have I missed any additional options?

  1. What clustering techniques would work well?

    2.1) I know tensor parallelism across two machines is nice for prefill acceleration (and the strix halo would benefit from higher prefit speed for agentic coding workloads to process the high context), How is the stack for this? I know of vLLM strix halo toolboxes, is it painfull to install / has it been tried?

    2.2) Pipeline paralelism, does it offer any generation speed advantages in tokens/ sec? I would preferably want to use something decently fast for my work.

    2.3) Would something like Exo work on the strix halo? Ive only seen people use it with MAC clusters and Im under the impression that its a MAC Specific thing.

  2. To be more clear with my backgrond: I am an embeded engineer so I am ok with hacky solutions as long as someone else has done it before and made at least some documentation for it. I just figured out how to train my own models on Strix Halo using pytorch, it was a mess but I manged using some configuration. What were your experiences? is there another solution you can recomend? Distributed compute?

Would love to hear everyone's experience. Even if you got a setup like this running i would love to jump together on a quick call or sth (Im on the Local Llama discord btw) So just PM me and lets find a time. All responses welcome!

reddit.com
u/Thanks-Suitable — 6 days ago

How to Fine-Tune LLMs on AMD Strix Halo

After the first general general fine-tuning tutorial i posted (https://www.promptinjection.net/p/the-ultimate-llm-ai-fine-tuning-guide-tutorial) some people asked if i can't make the same for AMD Strix Halo because approach here is quite different because of RoCM.

https://preview.redd.it/g63fjundxh0h1.jpg?width=1080&format=pjpg&auto=webp&s=4ea6efb97b7306646303adc9020f0a075e08865b

I listened and here it is now:
https://www.promptinjection.net/p/how-to-fine-tune-llms-on-amd-strix-halo-ryzen-ai-max-395-sft-lora

- Linux and pure Windows (no WSL!)
- Full SFT and LoRA

reddit.com
u/PromptInjection_ — 3 days ago

Seeking suggestions for building my AI workflow

Hello, I recently got the Asus Rog Flow z13 128 GB and trying to make the best use of it

I’m trying to design a local-first AI research/coding workflow and would like feedback from people who have built similar setups.

## Hardware

I have a compact AMD Strix Halo laptop/tablet with:

- Ryzen AI Max+ class APU
- 128GB unified memory
- 1TB internal SSD
- 1TB external NVMe SSD in a USB-C enclosure, usually attached
- Dual-boot planned: Windows + Linux

I want to use it as a portable local AI workstation.

## Main goals

I want a workflow for:

1. Local LLM use for private/sensitive projects
2. /Quarto-style coding help
3. Research note-taking and literature synthesis
4. Building a searchable knowledge base from papers, notes, and scripts
5. Replacing Cursor-like features with VS Code-based tools
6. Avoiding vendor lock-in so I can switch between local models and cloud models as needed

## Privacy requirement

I want a strict two-lane setup:

### Private lane
Sensitive data, private scripts, private results, and internal project notes should only be accessed by local tools/models.

Possible tools:
- VS Code
- Continue.dev
- Ollama / LM Studio
- local models
- local RAG
- local Obsidian vault
- Git

No cloud LLMs should see this material.

### Cloud-safe lane
Public papers, sanitized code, general methods notes, and public/sanitized writing can use cloud models.

Possible tools:
- ChatGPT
- Claude
- Gemini
- Claude Code or Cline
- cloud-safe Obsidian vault
- cloud-safe RAG index

## Proposed folder structure

```text
~/research/
├── private_DO_NOT_CLOUD/
│   └── project/
│       ├── data_raw/
│       ├── data_derived/
│       ├── scripts/
│       ├── results/
│       ├── notes/
│       ├── obsidian_private/
│       ├── rag_private/
│       └── state/
│           ├── context_summary.md
│           ├── decisions.md
│           ├── next_steps.md
│           └── session_log.md
│
├── cloud_safe_OK/
│   ├── papers/
│   ├── paper_notes/
│   ├── sanitized_code/
│   ├── sanitized_manuscript/
│   ├── obsidian_cloudsafe/
│   └── rag_cloudsafe/
│
├── shared_ai_rules/
│   ├── AGENT_RULES.md
│   ├── PATTERNS.md
│   ├── LEARNINGS.md
│   └── SESSION_HANDOFF.md
│
└── model_bench/
    ├── benchmark_prompts/
    ├── results/
    ├── model_scores.csv
    └── current_models.md

Tool plan

My current idea:

Private lane

  • VS Code + Continue.dev
  • Ollama or LM Studio as local backend
  • local models only
  • Obsidian private vault
  • local RAG index
  • Git for audit trail

Cloud-safe lane

  • VS Code + Continue.dev
  • optional Claude Code or Cline
  • cloud models only on sanitized/public files
  • Obsidian cloud-safe vault
  • separate RAG index

Maybe later

  • Aider for stricter Git patch workflows
  • LiteLLM/OpenRouter if model routing becomes annoying
  • More advanced memory tools only if Markdown-based memory is not enough

Model strategy

I don’t want to hard-code the workflow around one model.

I want roles like:

private_fast
private_coder
private_reasoner
private_writer
private_embeddings

Then I can periodically test new local models and replace the active model for a role if the new one performs better.

I plan to keep a small benchmark folder with prompts for:

  • R or Python coding
  • Quarto/notebook generation
  • data QC logic
  • methods writing
  • debugging
  • privacy compliance
  • RAG-based answering

Questions

  1. Is this two-lane privacy architecture reasonable?
  2. Is VS Code + Continue.dev the best Cursor replacement for a vendor-agnostic workflow?
  3. Would you add Cline or Aider early, or wait until the basic workflow is stable?
  4. What is the cleanest way to prevent accidental cloud exposure from the private folder?
  5. Is Obsidian + Markdown state files enough for long-term memory, or should I use a dedicated memory layer?
  6. For local RAG, would you recommend DuckDB/FAISS/Chroma/ragnar/raghilda or something else?
  7. For a Strix Halo 128GB machine, what local model/backends would you test first?
  8. Any suggestions for keeping Git history clean when AI agents edit files?
  9. What would you simplify in this plan?
  10. What are the main failure modes I should watch out for?

I’m trying to keep the setup practical and not over-engineer it, while making sure private data stays local and the workflow remains model/vendor agnostic. I would love to have your opinion on this.

reddit.com
u/No_Cap_5982 — 5 days ago

MTP llama.cpp -- anyone run it yet?

Anyone run models like qwen3.6 27B with the MTP llama.cpp patch set yet? I am assuming it will help considerably, but was curious if anyone tried it out yet?

reddit.com
u/skibud2 — 6 days ago

as stated. im trying to see if any of you have pushed things that high (124GB). Im seeing what I can squueze out of the box. spent a day of tuning for this but can only get to 110GBish , when I was on beta I know I got 115 before oom just can 't remember what got me there.

Thanks team, Ryan.

reddit.com
u/IQReactor — 9 days ago

Hi I am looking for some advice on the best local models I can use for coding via VS Code or similar, running on a Corsair AI Workstation 300 with AMD Ryzen AI Max+ 395 / 128Gb (96Gb shared for VRAM).

Please suggest best models I can use along with tips on best extensions for VS Code or similar to work with.

So far I have managed to get some models running okay on this device, notes below, but would appreciate comments / help from others using similar spec hardware.

qwen3.6:35b-a3b - works quickly but created code often has issues, regularly gets "stuck" and goes in circles trying to fix obvious errors

qwen3.6:27b - slow than above as expected, but produces better output, although still a lot of coding issues to often fix before projects will build / run

Have tested with VS Code (using Roo Code and Cline extensions), also tried OpenCode but kept getting errors

Found that VS Code extensions worked better when accessing models via Ollama rather than LM Studio, although in testing LM Studio does seem quicker on this system, but definitely not so reliable via Roo Code

In all cases have increased context size as much as VRAM allowed.

Any other suggestions on

a) Models I could try which would give better coding results than qwen3.6:27b ?

b) different tweaks / settings to try to improve performance and output

c) different apps / extensions / IDE's etc to try

reddit.com
u/wingers999 — 8 days ago

I like numbers myself so contributing.

FYI, below is formatted with AI :

Technical Benchmark: Nimo AI Mini PC - AMD Ryzen AI Max+ 395 (Strix Halo)

Sharing a comprehensive performance review of the Nimo AI Mini PC. This unit features the new Strix Halo architecture with 128GB of unified LPDDR5X memory and a 2TB SSD. Tests cover gaming (1440p), synthetic benchmarks (4K), and large-scale AI inference (128B Model).


System Specifications

  • Model: Nimo AI Mini PC
  • CPU: AMD Ryzen AI Max+ 395 (Strix Halo)
  • GPU: AMD Radeon 8060S
  • RAM: 128GB LPDDR5X (121Gi Visible)
  • Storage: 2TB NVMe SSD
  • OS: Linux Mint 22.3 / Ubuntu 24.04
  • Driver: Mesa 25.2.8 / ROCm 7.8.0

AI Inference Performance (Mistral-Medium 128B)

One of the standout features is the 128GB unified memory, allowing for ultra-large model offloading.

  • Model: Mistral-Medium-128B-Q4_K_M (~75GB)
  • Generation Speed: 1.57 tok/sec (Sustained)
  • VRAM Utilization: 79Gi (Unified Memory)
  • Peak Power: 145.0W (Prefill/Bursts)
  • Peak Noise: 46 dBA

Note: Successfully offloaded the entire 128B model to the iGPU with ~40Gi remaining for context.


Gaming & Graphics Benchmarks

DOOM Eternal (1440p Ultra Nightmare)

  • Resolution: 2560 x 1440 (1440p)
  • Preset: Ultra Nightmare (Maxed)
  • Framerate: 137 - 144 FPS (Stable / 144Hz Monitor Cap)

Unigine Superposition (4K Optimized)

  • Score: 7900
  • Average FPS: 59.1
  • Preset: 4K Optimized

Hardware Telemetry & Thermal Performance

Captured during sustained peak load (150W Power Envelope).

Idle Baseline:

  • System Power: 6.1W
  • Temperature: 40.9°C
  • Fan Noise: 27 dBA

Peak Load Performance:

  • Peak System Power: 154.1W
  • Peak GPU Temp: 88.0°C
  • Max GPU Clock: 2900 MHz
  • Peak CPU Temp: 88.5°C
  • Max CPU Load: 42.4% (Gaming)
  • Max VRAM Used: 79 GB (AI Inference)
  • Peak Fan Noise: 46 dBA

Technical Fixes Applied

To unlock the full potential of this Strix Halo unit:

  • RAM Carve-out: Adjusted BIOS UMA settings to unlock full 128GB (121Gi visible).
  • Driver Initialization: Removed amdgpu from modprobe blacklist for ROCm support.
  • Optimizations: Utilized HIPFIRE_MMQ=1 and HSA_OVERRIDE_GFX_VERSION=11.0.13.

reddit.com
u/westsunset — 10 days ago

Hey folks, I'm currently using the Q5\_K\_M version of Qwen3.5-122B-A10B on llama.cpp running on my 128Gb Strix Halo box running Ubuntu and it is working well in chat mode.

However, when I connect with Claude Code that's when the performance drops dristacally.

Did anyone have any success connecting it with claude code? I am facing issues with response being extremely slow in claude code mode and no Agentic behaviour aka- job doesn't spawn multiple agents.

Any help is appreciated.

reddit.com
u/DieHard028 — 11 days ago