u/Outrageous_Bug_669

Sharing a guide I just published for fine-tuning 27B+ LLMs on AMD Strix Halo (Ryzen AI MAX+ 395, Radeon 8060S / gfx1151, 128 GB unified memory). MIT licensed.

Repo: https://github.com/h34v3nzc0dex/strix-halo-llm-finetune-guide

None of the individual pieces are novel — kernel patches, ROCm 7.13 nightly, FLA, bitsandbytes, LoRA, llama.cpp. The intersection (Strix Halo + gfx1151 + FLA + Qwen3.5 hybrid at 27B) isn't documented anywhere I could find, and getting it stable took a lot of dead ends I'd rather other people skip.

Stack tested: kernel 6.19.14, PyTorch 2.11.0+rocm7.13.0a20260506, ROCm 7.13 nightly, FLA 0.5.1 patched, bitsandbytes 0.50.0.dev0 built from source for gfx1151, llama.cpp b867+. Hardware: Corsair AI Workstation 300 (Sixunited AXB35-02 board, BIOS 3.07).

Things the guide actually covers that I had to figure out the hard way:

PyPI bitsandbytes ships zero ROCm binaries. From-source build with -DROCM_VERSION=83, plus a runtime symlink libbitsandbytes_rocm83.so → libbitsandbytes_rocm713.so so bnb's HIP detection on PyTorch 2.10/2.11 stops complaining.
FLA's Triton kernels crash on gfx1151 (RDNA 3.5) with num_warps > 4 (Triton#5609) and a tl.cumsum + tl.sum codegen interaction (Triton#3017). Idempotent re-patch script included.
In-process Trainer eval at 27B / 8192 seq length is structurally broken on unified-memory APUs — either kernel TTM page allocation failure from fragmentation, or memory watchdog SIGKILL when free RAM drops under ~8 GB. Eval is moved out-of-process via a bash orchestrator aligned to save_steps, waiting for full GPU release between train and eval, with a JSONL trend log.
Mainline kernel .deb run-parts double-dir bug on Ubuntu 24.04+ leaves packages half-configured. Repack script included.
/srv perms regressing to 0750 mid-training breaks importlib.metadata path traversal and crashes TRL's create_model_card. Cron watchdog restoring 755.

Verified result: in-progress production fine-tune of Qwen3.5-27B (hybrid, 16 full-attention + 48 GatedDeltaNet layers), bf16 LoRA r=128/α=256, eval rolling at 0.13 loss / 96.5% token accuracy, ~11 min/step, ~4-day total runtime.

Feedback and issues welcome, especially from people on different AXB35-02 boards or non-Corsair Strix Halo systems — I'd like to know what's board-specific vs. generic.

https://preview.redd.it/8i3ebs27h00h1.jpg?width=649&format=pjpg&auto=webp&s=1a4fe453e9e46c97b71a14b993b9536288169ca1

fine-tuning 27B hybrid models on stri to ox halo (ryzen ai max+ 395 / gfx1151, 128 gb unified memory) — full guide, patches, orchestrator

fine-tuning 27B hybrid models on strix halo (ryzen ai max+ 395 / gfx1151, 128 gb unified memory) — full guide, patches, orchestrator