
fine-tuning 27B hybrid models on strix halo (ryzen ai max+ 395 / gfx1151, 128 gb unified memory) — full guide, patches, orchestrator
Sharing a guide I just published for fine-tuning 27B+ LLMs on AMD Strix Halo (Ryzen AI MAX+ 395, Radeon 8060S / gfx1151, 128 GB unified memory). MIT licensed.
Repo: https://github.com/h34v3nzc0dex/strix-halo-llm-finetune-guide
None of the individual pieces are novel — kernel patches, ROCm 7.13 nightly, FLA, bitsandbytes, LoRA, llama.cpp. The intersection (Strix Halo + gfx1151 + FLA + Qwen3.5 hybrid at 27B) isn't documented anywhere I could find, and getting it stable took a lot of dead ends I'd rather other people skip.
Stack tested: kernel 6.19.14, PyTorch 2.11.0+rocm7.13.0a20260506, ROCm 7.13 nightly, FLA 0.5.1 patched, bitsandbytes 0.50.0.dev0 built from source for gfx1151, llama.cpp b867+. Hardware: Corsair AI Workstation 300 (Sixunited AXB35-02 board, BIOS 3.07).
Things the guide actually covers that I had to figure out the hard way:
- PyPI bitsandbytes ships zero ROCm binaries. From-source build with
-DROCM_VERSION=83, plus a runtime symlinklibbitsandbytes_rocm83.so → libbitsandbytes_rocm713.soso bnb's HIP detection on PyTorch 2.10/2.11 stops complaining. - FLA's Triton kernels crash on gfx1151 (RDNA 3.5) with
num_warps > 4(Triton#5609) and atl.cumsum + tl.sumcodegen interaction (Triton#3017). Idempotent re-patch script included. - In-process Trainer eval at 27B / 8192 seq length is structurally broken on unified-memory APUs — either kernel TTM page allocation failure from fragmentation, or memory watchdog SIGKILL when free RAM drops under ~8 GB. Eval is moved out-of-process via a bash orchestrator aligned to
save_steps, waiting for full GPU release between train and eval, with a JSONL trend log. - Mainline kernel
.debrun-parts double-dir bug on Ubuntu 24.04+ leaves packages half-configured. Repack script included. /srvperms regressing to 0750 mid-training breaksimportlib.metadatapath traversal and crashes TRL'screate_model_card. Cron watchdog restoring 755.
Verified result: in-progress production fine-tune of Qwen3.5-27B (hybrid, 16 full-attention + 48 GatedDeltaNet layers), bf16 LoRA r=128/α=256, eval rolling at 0.13 loss / 96.5% token accuracy, ~11 min/step, ~4-day total runtime.
Feedback and issues welcome, especially from people on different AXB35-02 boards or non-Corsair Strix Halo systems — I'd like to know what's board-specific vs. generic.