u/Creepy-Douchebag

FLM on Strix Halo Linux: where I think the NPU fits

I spent some time testing FastFlowLM on a Ryzen AI MAX+ 395 Strix Halo box under Ubuntu 24.04 with kernel 7.0 and the in-tree amdxdna driver.

Short version: the NPU is real and useful, but I do not think it should be the only local inference backend.

My rough FLM numbers, using FLM's own counters:

Model Decode
qwen3:0.6b 93 t/s
qwen3:1.7b 42 t/s
llama3.2:3b 25 t/s
qwen3:4b 19 t/s
qwen3:8b 11 t/s
llama3.1:8b 11 t/s
gpt-oss:20b 20 t/s

The interesting part is not just raw speed. It is where the NPU is useful.

Small models are great. A 0.6B or 1.7B model on the NPU is exactly what I want for always-on local assistant work: routing, summarization, command interpretation, RAG glue, tool selection, and background agents. It can stay hot without waking up the whole APU.

The 3B/4B tier is also practical. llama3.2:3b at ~25 t/s feels like a real local assistant lane, especially if power matters.

Dense 8B models are less exciting at ~11 t/s. Usable, but that is probably where I would rather use the iGPU with llama.cpp/Vulkan or ROCm once the stack matures.

The surprise is MoE. gpt-oss:20b at ~20 t/s is about the same as qwen3:4b, which makes sense if active parameters are in the same ballpark. That may be the real NPU sweet spot: not huge dense models, but efficient low-active-parameter MoE.

So my current mental model is:

  • NPU/FLM: low-power always-on lane, small models, supported MoE, agent plumbing
  • iGPU/llama.cpp Vulkan: general local LLM lane, GGUF ecosystem, bigger dense models
  • CPU: fallback and tiny utility models
  • A router above all of it: one API, multiple backends

Windows seems closer to a unified vendor-supported plane for Ryzen AI. Linux does not have that yet. On Linux, FLM is the thing that makes the NPU useful today, but it is still a separate runtime with its own format, model list, bugs, and licensing.

The licensing is also worth noting. FastFlowLM says the orchestration/CLI code is MIT, but the NPU kernels are proprietary binaries. Their README says commercial use is free up to USD 10M annual company revenue, then you need a commercial license. That is fine for hobby projects and probably many small products, but it is not the same risk profile as a fully open stack like llama.cpp.

My takeaway: FLM is a very useful backend, but not the foundation. The foundation should be a Linux inference router that can dispatch across NPU, iGPU, and CPU. FLM plugs into that as the NPU lane.

That is probably the shape Strix Halo wants: not one runtime to rule everything, but a local inference plane that knows which silicon to use for each job.

reddit.com
u/Creepy-Douchebag — 8 days ago

Strix Halo NPU + FastFlowLM — quick throughput numbers (Linux, kernel 7.0)

Bare-metal benchmark, single chat completion per model after a short warm-up call. Measured on:

CPU/APU         AMD Ryzen AI MAX+ 395 (Strix Halo)
NPU             RyzenAI-npu5 / XDNA2, 8 columns × 4 rows = 32 AIE tiles
NPU firmware    1.1.2.65
RAM             128 GB
OS              Ubuntu 24.04.4 LTS
Kernel          7.0.0-070000-generic (mainline, in-tree amdxdna 0.6)
XRT (userspace) 2.21.75 from ppa:amd-team/xrt
Runtime         FastFlowLM (FLM) v0.9.40, Q4NX format, NPU-only
Endpoint        OpenAI-compatible /v1/chat/completions on 127.0.0.1:52625
amd_iommu       NOT disabled (NPU requires it)
memlock         unlimited
Model store     ~/.config/flm/models/<Model>-NPU2/model.q4nx

Numbers (from FLM's own usage block on the second prompt)

Prompt: "List five common programming languages with one line about each." · max_tokens=120, temperature=0.0, stream=false.

Model Prompt tok Output tok TTFT (ms) Prefill (t/s) Decode (t/s)
qwen3:0.6b 28 20 520 53.8 93.0
qwen3:1.7b 24 120 623 38.5 42.1
llama3.2:3b 46 120 902 51.0 24.9
qwen3:4b 24 120 1061 22.6 19.4

All numbers are FLM's own counters from the JSON response (prefill_speed_tps, decoding_speed_tps, prefill_duration_ttft). One run each, so treat as ballpark, not a proper N=10 study.

Observations

  • qwen3:0.6b stops at 20 tokens — it produces a tight numbered list and emits its end-of-sequence token cleanly. The other Qwen3 runs lean on Qwen3's reasoning/<think> mode and burn the full 120-token budget, which is why their decode rate looks "lower" but their wall time is comparable to a non-thinking model that emits a similar amount of text.
  • Llama 3.2 3B at ~25 t/s decode matches FLM's own published Strix Halo number (~28 t/s on a different prompt) within run-to-run noise.
  • qwen3:4b at ~19 t/s decode is in the same band as gpt-oss-20b (FLM published ~19 t/s for that — MoE so the active params are ~3-4B, equivalent compute).
  • TTFT scales roughly linearly with model size, ~0.5 → 1 s in this 0.6B → 4B sweep. Cold-start (first ever request after flm serve boots) adds another ~3-5 s on top of these numbers; we discard that with the warm-up.
  • Power draw stays under ~2 W on the NPU + a few W on the CPU/RAM during decode (FLM's own power claim; not measured here, but xrt-smi examine -r dynamic-regions corroborates qualitatively — the NPU is the only thing busy).

Reproducer

The runtime is a single .deb:

# Kernel ≥ 7.0 (in-tree amdxdna), amd_iommu NOT off, memlock unlimited, render group.
curl -fL --progress-bar -o /tmp/flm.deb \
    https://github.com/FastFlowLM/FastFlowLM/releases/download/v0.9.40/fastflowlm_0.9.40_ubuntu24.04_amd64.deb
sudo apt install -y /tmp/flm.deb
flm validate                     # should report "8 columns" and the FW version
flm pull qwen3:0.6b              # one-time download (~650 MB)
flm serve qwen3:0.6b --quiet &   # foreground server on port 52625

then any OpenAI-compatible client works:

from openai import OpenAI
c = OpenAI(base_url="http://127.0.0.1:52625/v1", api_key="flm")
r = c.chat.completions.create(
    model="qwen3:0.6b",
    messages=[{"role": "user", "content": "Hi!"}],
    max_tokens=64,
)
print(r.choices[0].message.content)

Caveats / what's NOT in this number set

  • Single-token completions crash FLM v0.9.40 with a basic_string::substr bounds error (chat endpoint returns HTTP 200 with {"error": ...}). All of the above prompts force multi-token output to dodge it. Filed/known in the FLM tracker.
  • AMD's official Ryzen AI Linux stack (OGA, model_generate) does NOT enroll Strix HaloRyzenAI-SW#366. So this is FLM-vs-nothing on STX-H; there's no native AMD-stack number to put alongside on this hardware.
  • pyxrt 2.21.75 from ppa:amd-team/xrt does not enumerate amdxdna 0.6 (enumerate_devices() returns 0 even though xrt-smi examine and flm validate both bind the device). FLM bypasses pyxrt and links libxrt2 from C++ directly, so this packaging gap doesn't affect FLM perf, but it does break IRON / mlir_aie Python flows on kernel 7.0 until AMD ships an updated python3-xrt.
  • Sourcing /opt/xilinx/xrt/setup.sh BEFORE running flm serve makes FLM see zero NPU devices — keep the XRT env unset when running FLM.
  • These are ballpark single-run numbers. For a real benchmark you want N=10+ with throughput averaged across long prompts/long outputs, plus power telemetry from xrt-smi. Happy to share the harness (/tmp/flm_quickbench.sh) if anyone wants to extend it.
reddit.com
u/Creepy-Douchebag — 9 days ago

⚠️ UPDATE (2026-04-28). Most of this post is wrong. Two separate corrections; both are mine to own. Leaving the post up with strikethroughs because the failure mode is more interesting than the original argument, and I owe clean retractions more than a clean delete.

Correction #1 — I conflated two AMD stacks that are not the same stack. I drew confident conclusions about the wrong one and skipped the one that actually answers the question I was asking.

The two stacks:

  1. Ryzen AI 1.7.1 deployment toolkit — closed-source, DirectML-bound. I extracted this from the public NuGet (RyzenAI_Deployment.1.7.1.nupkg), which contains only Windows DLLs (onnxruntime_providers_ryzenai.dll, onnxruntime_vitisai_ep.dll, dyn_dispatch_core.dll, etc). The NuGet observation is accurate. The conclusion I drew from it — "therefore there is no Linux Ryzen AI EP at all" — was wrong (see Correction #2 below).
  2. Lemonade Server — a separate AMD-built, MIT-licensed, open-source project. Linux build exists. Has its own backend recipe system (llamacpp:rocm, llamacpp:vulkan, flm:npu, kokoro:cpu, whispercpp:*, sd-cpp:*). Officially documents Linux NPU support via FastFlowLM at lemonade-server.ai/flm_npu_linux.html, dated 2026-03-11, co-authored by the Lemonade team and the FastFlowLM team.

I investigated stack #1 thoroughly. I never looked at stack #2. I assumed "AMD on Linux = no NPU LLM" because the deployment NuGet I tore apart shipped no Linux artifacts. Lemonade is the open-stack answer to the LLM-on-NPU-on-Linux question, and it's been working since March 2026.

Correction #2 — AMD's closed Ryzen AI EP also exists for Linux. I just couldn't see it from the NuGet because it's distributed separately, behind an account login.

The official page ryzenai.docs.amd.com/en/latest/llm_linux.html (last updated 2026-04-19) walks through running LLMs on the NPU under Linux using libonnxruntime_providers_ryzenai.so — the Linux equivalent of the .dll I extracted. Distribution constraints:

  • Bundle: ryzen_ai-1.7.1.tgz + RAI_1.7.1_Linux_NPU_XRT.zip, both downloaded from account.amd.com (AMD account login required).
  • Packaged as .deb, target Ubuntu 24.04 LTS (Python 3.12, kernel ≥ 6.10). Not on Arch / not on PyPI; AMD also runs pypi.amd.com/ryzenai_llm/1.7.1/linux/simple/ for the model-generate Python piece.
  • Cited bench in their docs: Phi-3.5-mini-3.8B at ~17.6 tok/s decode, 864 tok/s prefill on the NPU through this stack.
  • Loads AMD's own 200+ AWQ/OGA HF checkpoints (huggingface.co/amd/*_rai_1.7.1_npu_*) — the artifacts that Lemonade + FLM can't run today.

So my original "zero Linux runtime exists" claim was wrong on two counts: Lemonade + FLM is the open Linux LLM-on-NPU path (Correction #1), and AMD's closed Ryzen AI EP also has a Linux distribution (Correction #2) — it's just .deb-only, auth-walled, and not in any public package manager. The NuGet I extracted is Windows-only; what I missed is that AMD ships a parallel Linux bundle through a different channel.

Correction #2.5 — "Ubuntu 24.04 only" is not a packaging restriction. It's a kernel-ABI restriction. I tested this. I installed the .deb stack on a CachyOS box via Distrobox (Ubuntu 24.04 container with NPU pass-through), got xrt-smi examine to see the NPU just fine, ran AMD's quicktest.py (basic ONNX inference on NPU through VitisAIExecutionProvider) and it passed cleanly — "Test Finished", session initialized, NPU responded. So far so good. Then I tried the actual LLM flow (onnxruntime-genai-ryzenai loading Phi-3.5-mini-instruct_rai_1.7.1_npu_4K) and got the least helpful possible error message:

[E:onnxruntime:, inference_session.cc:2544 operator()] Exception during initialization: Generic Failure

Closed .so, no string. Quicktest passes; LLM init fails. The difference: the strict RyzenAI provider used for LLMs requires the AMD-shipped kernel module (xrt_plugin-amdxdna 2.21.260102.53.release, distributed in the same .deb bundle as the userspace), and that module's DKMS build fails inside the container because there are no kernel headers for 7.0.2-1-cachyos to compile against. The host's in-kernel amdxdna handles the userspace handshake (so xrt-smi works, quicktest works), but the strict EP path checks something the AMD-shipped module exposes that the in-kernel one doesn't, and rejects with no diagnostic.

Practical implication: AMD's Linux Ryzen AI flow runs on Ubuntu 24.04 with AMD's amdxdna .ko loaded — full stop. Distrobox-on-Arch isn't enough. Native Arch / Fedora / etc. would need to either (a) build AMD's amdxdna against the running kernel via DKMS, replacing the in-kernel one, or (b) boot Ubuntu 24.04 natively. The "Ubuntu only" framing in the docs is technically a kernel ABI gate, not a packaging convenience.

This makes Lemonade + FastFlowLM the only Linux NPU LLM path that works on non-Ubuntu distributions today. Not because AMD ignored Linux — they shipped the EP — but because the closed EP's kernel-module pinning makes anything outside Ubuntu 24.04 a port project. (FLM, by contrast, is happy with whatever in-kernel amdxdna your distro provides.)

What pushed me into the wrong conclusion: when I tried to use Lemonade's flm:npu recipe on Linux, it reported update_required: Backend update is required before use and the auto-installer raised "FLM auto-install is only supported on Windows." I read that and stopped. The actual cause: a strict-equality version-pin in /usr/share/lemonade-server/resources/backend_versions.json — the manifest pins flm.npu = v0.9.38, the Arch AUR package ships v0.9.39. Any newer/older patch version flips the recipe to update_required even when FLM is fully installed and validates green. One sed to bump the pin and the recipe reads installed, the same OpenAI-compat API on :13305 now routes to NPU or iGPU per model, and benches confirm both lanes serve LLMs end-to-end (numbers at the bottom).

If you read the original post and updated your priors based on it: please update them again. The closed-EP critique is fine; the "Linux NPU LLM on Strix Halo doesn't exist" framing is wrong.

Strikethroughs below mark the falsified claims.

TITLE OPTIONS (pick one):

A) AMD's Ryzen AI NPU LLM stack is structurally Windows-only. I extracted the 1.7.1 toolkit. Here's why. → AMD's Ryzen AI 1.7.1 stack is Windows-only — but that's not the only AMD stack, and I missed Lemonade.

B) Receipts: Why FastFlowLM is the only Linux NPU LLM runtime on Strix Halo (the AMD/HF model zoo, the Windows wall, what's actually open) → Receipts retracted: Lemonade + FastFlowLM is the officially documented Linux NPU LLM stack, not a community workaround.

C) Linux + Ryzen AI: The model artifacts are public. The runtime is gated by DirectML. Deep dive with extracted DLL list. → Linux + Ryzen AI: the AWQ/OGA EP is gated by DirectML. The LLM-on-NPU path on Linux is Lemonade + FLM, fully shipped.

SUGGESTED SUBREDDIT: r/LocalLLaMA (cross-post candidates: r/AMD, r/ROCm, r/Amd_Tech)

POST BODY:

I have a Strix Halo (Ryzen AI MAX+ 395) box and have been running 1-bit / sub-2-bit LLMs locally. The GPU lane (ROCm + Vulkan llama.cpp) is great. The NPU lane is more interesting — and more frustrating. So I tore apart AMD's Ryzen AI 1.7.1 release to figure out why.

TL;DR: AMD has 200+ Strix-Halo-targeted LLM checkpoints publicly available on HuggingFace. The runtime to execute them on the NPU is structurally Windows-only — not because of packaging, but because it's built tightly on top of Microsoft DirectML + DirectX 12. A Linux port is research-tier work, not a packaging fix.

TL;DR (corrected): Ryzen AI 1.7.1's deployment EP is structurally Windows-only and built on DirectML. Lemonade Server is a separate AMD-built MIT stack with documented Linux NPU support via FastFlowLM. I conflated them. The Linux LLM-on-NPU lane works today via pacman -S xrt xrt-plugin-amdxdna fastflowlm + Lemonade Linux's flm:npu recipe, with one packaging gotcha (a version-pin equality bug in Lemonade's backend_versions.json).

Same hardware, two parallel universes

Capability (on the exact same Strix Halo box) Windows Linux
AMD official NPU LLM runtime ✅ Bundled (DirectML + Ryzen AI EP) ❌ Doesn't exist
Ryzen AI 1.7.1 EP (onnxruntime_providers_ryzenai) ❌ No Linux .so ✅ Bundled, DirectML-bound ⚠️ libonnxruntime_providers_ryzenai.so does ship for Linux — separate bundle (ryzen_ai-1.7.1.tgz) at account.amd.com, Ubuntu 24.04 / .deb only
Lemonade Server (separate AMD/MIT stack) ✅ MSI installer ✅ AUR/PPA + Linux-native flm:npu recipe (officially documented)
Lemonade Server with ryzenai-llm ✅ One-click GUI installer update_required: Backend update required — auto-installer "only supported on Windows"
Lemonade ryzenai-llm:npu recipe ✅ Windows-only ❌ Windows-only by design (uses the closed EP)
Lemonade flm:npu recipe n/a (Windows uses ryzenai-llm) ✅ Linux, manual install via pacman -S fastflowlm (auto-install is Windows-only by convention)
AMD's 200+ NPU LLM checkpoints ✅ Loadable ⚠️ Downloadable, no runtime to execute them
AMD's 200+ NPU LLM checkpoints (UINT4-AWQ, OGA hybrid) ✅ Loadable via Ryzen AI EP ⚠️ No Linux runtime for the AWQ/OGA artifacts. FLM ships its own AMD-aligned model collection (qwen3, gemma3, phi4-mini, lfm2, llama3.2, deepseek-r1, gpt-oss) that runs on Linux NPU.
Quark quantizer workflow + conda env ✅ Documented, working ❌ READMEs literally say "activate the Ryzen AI 1.7.1 conda environment" — that env is Windows-only
*.xclbin precompiled NPU binaries ✅ Loaded by Windows driver ✅ Loadable via amdxdna + xrt-plugin-amdxdna
LLM-on-NPU lane in practice AMD official stack One third-party project (FastFlowLM)
LLM-on-NPU lane in practice AMD's closed RyzenAI EP Lemonade + FastFlowLM (officially documented, jointly authored)
Diffusion / Whisper / CLIP on NPU ✅ AMD ships them ⚠️ Whisper-on-NPU works via FLM; diffusion + CLIP still mostly nothing

That's the gap, on the same chip. Windows gets the complete vertical pipeline. Linux gets the silicon and the artifacts and is told "good luck."

The actual gap, accurately scoped: Windows gets AMD's closed DirectML-bound EP plus the AWQ/OGA workflow. Linux gets an open AMD-built Lemonade stack with FLM that runs LLMs on the NPU. Real remaining gaps on Linux: AWQ/OGA model loading (no xrt-based EP), diffusion-on-NPU, Quark→Linux deployment tail. Not "no NPU LLM."

The good news: artifacts are public

huggingface.co/amd has a lot more than I initially thought. Naming convention is the index:

Suffix Target
-onnx-ryzen-strix Strix Halo NPU+iGPU hybrid
_rai_1.7.x Ryzen AI runtime version-pinned (newest format)
-onnx-hybrid Same hybrid mode, older naming
-onnx-directml Windows DirectML GPU
-onnx-cpu CPU

Architectures shipped for Strix Halo NPU: Llama 2/3/3.1/3.2 (1B-8B), Mistral 7B, Qwen 1.5/2/2.5, Phi-3 mini, Phi-3.5 mini, Gemma-2 2B, ChatGLM 3 6B, DeepSeek-R1 Distill (Qwen 1.5B/7B, Llama 8B), AMD-OLMo 1B, LFM2 1.2B, CodeLlama 7B, xLAM 2 8B. All AWQ uint4 g128 with BF16 activations. Built via Quark → OGA Model Builder → NPU-deployment finalization.

There are also two fusion variants per model: token-fusion (up to 16K context, slower per-token) and full-fusion (4K context cap, fastest per-token). LFM2-1.2B token-fusion is on HF today; full-fusion not yet visible.

The wall: extracted the actual deployment binaries

RyzenAI_Deployment.1.7.1.nupkg (298 MB), pulled from 1.7.1_nuget_signed.zip:

runtimes/win-x64/native/
├── onnxruntime_providers_ryzenai.dll   9.1 MB   ← The Ryzen AI EP itself
├── onnxruntime_vitisai_ep.dll        143 MB   ← VitisAI EP
├── dyn_dispatch_core.dll             145 MB   ← NPU dispatcher
├── dyn_bins.dll                      234 MB   ← Precompiled NPU dispatch graphs
├── onnxruntime.dll                    21 MB
├── onnxruntime-genai.dll             4.5 MB
├── DirectML.dll                       18 MB   ← Windows-only by definition
├── D3D12Core.dll                     3.2 MB   ← DirectX 12 (Windows-only)
├── aiecompiler_client.dll            9.7 MB
├── ryzen_mm.dll, ryzenai_onnx_utils.dll, …

Zero runtimes/linux-*/ entries. Zero .so files. The .nuspec literally says: "This package contains native shared library artifacts for AMD RyzenAI." Singular OS.

This part stands. The DirectML / VitisAI / dyn_dispatch stack is Windows-only. What was wrong was concluding "therefore Linux has nothing" — the deployment EP is one of two AMD LLM-on-NPU paths, and I never investigated the second one.

What IS open on Linux

  • The NPU silicon (XDNA 2 / AIE2P) — same hardware
  • xclbin files (precompiled AIE2P binaries) — architecture-portable
  • amdxdna kernel module + xrt-plugin-amdxdna (Arch package) load xclbins on Linux
  • Hardware ID: PCI\VEN_1022&DEV_17F0 for Strix Halo NPU
  • ONNX Runtime base via pip install onnxruntime
  • AMD's HF model artifacts — downloadable on Linux (no Linux runtime for the AWQ/OGA ones)
  • Lemonade Server (MIT, AMD-built, Linux build available) with the flm:npu recipe — the actual working LLM-on-NPU lane on Linux
  • FastFlowLM (open) — the runtime behind flm:npu. Documented at lemonade-server.ai/flm_npu_linux.html, jointly authored with AMD's Lemonade team.

What works on Linux today

FastFlowLM (third-party). Serves UINT4-AWQ q4nx models on the NPU via the Linux xrt-plugin-amdxdna path. Coverage: qwen3, gemma3, phi4-mini, lfm2, llama3.2, deepseek-r1, gpt-oss, etc. — overlap with AMD's official catalog but distinct binaries.

That's the entire Linux NPU LLM lane. One third-party project.

Lemonade + FastFlowLM, jointly documented Linux path (lemonade-server.ai/flm_npu_linux.html, 2026-03-11). Serves UINT4 q4nx models on the NPU via xrt-plugin-amdxdna. Coverage: qwen3, gemma3, phi4-mini, lfm2, llama3.2, deepseek-r1, gpt-oss. Behind Lemonade's OpenAI-compat API, same endpoint as the iGPU llamacpp:rocm lane. One unified API, two compute lanes, both lanes serving LLMs today.

The Linux journey is harder than it should be The Linux journey has one packaging bug, otherwise documented end-to-end

Working on this on Linux means doing Windows archaeology to figure out what tools to NOT have:

  • AMD's model READMEs literally say "Activate the Ryzen AI 1.7.1 conda environment" → Still true for the AWQ/OGA workflow. Lemonade-on-Linux doesn't need that env at all.
  • The Quark quantizer's published example scripts default to cuda or cpu. The deployment-to-NPU last mile lives in a separate, gated, Windows-only toolchain. → Still true for Quark→deployment-EP. FLM uses its own model collection so this is orthogonal to running LLMs on Linux NPU.
  • Lemonade Server's standalone installer is a Windows MSI. The Linux build of Lemonade exists, but the flm:npu and ryzenai-llm recipes both report update_required: Backend update is required before use. lemonade backends install flm:npu — and the auto-installer raises "FLM auto-install is only supported on Windows".The update_required is a strict-equality version-pin bug in /usr/share/lemonade-server/resources/backend_versions.json — Lemonade pins flm.npu = v0.9.38, AUR ships v0.9.39, mismatch flips the recipe red. Bumping the pin (one sed) flips it green. The auto-installer is Windows-only by convention; Linux uses pacman -S fastflowlm, which is the documented path. ryzenai-llm:npu is genuinely Windows-only because it uses the closed EP.
  • I had to download a 2.6 GB Windows EXE, extract 22 nested CABs, crack a NuGet package, decompile DLL names with file, just to confirm there's nothing for Linux inside. → Still true for confirming the Ryzen AI EP is Windows-only. Not necessary for the "LLM on NPU on Linux" question, which Lemonade docs answer in two paragraphs.
  • The HuggingFace optimum-amd integration documents BrevitasQuantizer (sub-INT8 / N-bit) as "Coming soon." Doc has been a stub for a while. ← still true.

AMD owns every piece of this stack. The gap is a choice. AMD owns every piece. They shipped the open Linux path for LLM-on-NPU and shipped only the closed Windows path for the EP.

This isn't a packaging miss. AMD owns:

  • ✅ The silicon (XDNA 2 / AIE2P)
  • ✅ The kernel driver (amdxdna, in mainline Linux)
  • ✅ The userspace driver path (xrt-plugin-amdxdna)
  • ✅ The NPU dispatch binaries (*.xclbin, ship in their own WHQL driver ZIP)
  • ✅ The quantizer (Quark)
  • ✅ The model conversion pipeline (OGA Model Builder)
  • ✅ The model zoo (200+ artifacts at huggingface.co/amd)
  • Lemonade Server (MIT) with documented Linux NPU support via FLM

Missing piece on Linux: a ~9 MB shared object that talks to xrt instead of DirectML. AMD wrote the Windows version. They could write the Linux version. They didn't.

Missing piece on Linux: an xrt-based equivalent of onnxruntime_providers_ryzenai.dll so that AMD's AWQ/OGA model artifacts can run on the NPU under Linux.AMD wrote the Linux version too. It's libonnxruntime_providers_ryzenai.so, ships in ryzen_ai-1.7.1.tgz from account.amd.com, targets Ubuntu 24.04 / Python 3.12 / .deb. AMD's docs at ryzenai.docs.amd.com/en/latest/llm_linux.html walk through using it to run their AWQ/OGA HF artifacts on the NPU on Linux today.

The remaining gap is distribution, not engineering — Arch / Fedora / non-Ubuntu Linux users have to debtap, distrobox, or build from a tarball.The remaining gap is kernel ABI, not packaging. I tested this on Arch via Distrobox: the .deb userspace installs cleanly, xrt-smi examine sees the NPU through /dev/accel/accel0, the basic VitisAIExecutionProvider quicktest runs to completion. But the strict LLM-side RyzenAI provider rejects with "Generic Failure" (no further string) the moment onnxruntime-genai-ryzenai tries to load a model. Distrobox passes the device through; it cannot give the container its own kernel module, and AMD's closed EP requires the AMD-shipped amdxdna .ko (distributed in the same .deb bundle) — not whatever your distro's kernel ships in-tree. Ubuntu 24.04 is the only host where AMD's shipped module loads against the running kernel without DKMS source-build hassle.

Compare to vendors who shipped Linux NPU/accelerator runtimes day-one:

  • NVIDIA: cuDNN, TensorRT, NIM — Linux first, every release
  • Intel: OpenVINO, NPU runtime — Linux as primary target, parallel Windows
  • AMD's own ROCm: years of "almost ready" on Linux. Pattern repeating for the Ryzen AI EP. But Lemonade + FLM broke the pattern for the LLM-on-NPU use case — that one shipped on Linux in March 2026 with joint docs.

FastFlowLM stepping into this gap as a single third-party project keeps the lane open. It also makes AMD look bad. AMD should be embarrassed that their Linux NPU LLM story in 2026 is "go use a community tool because we couldn't be bothered."

That paragraph is wrong on every count. AMD's Linux NPU LLM story in 2026 is lemonade-server.ai/flm_npu_linux.html, jointly authored by AMD's Lemonade team and FastFlowLM, with a pacman -S line. Not a "go use a community tool" punt — an actual partnership and shipped doc. The story I should have told: AMD's Lemonade team partnered with FastFlowLM to ship the Linux NPU LLM stack, and the only friction left in 2026 is a one-line version-pin update in the package manifest.

Open-source theater: gating just enough to tease

Scoping this critique correctly to the EP/DirectML stack — load-bearing closed pieces remain closed:

  • amdxdna kernel driver — upstream Linux, MIT/GPL
  • ✅ AMD Quark quantizer — public PyTorch repo
  • ✅ Lemonade Server — MIT, Linux build, documented Linux NPU LLM path with FLM
  • ✅ HF model artifacts — MIT-licensed, downloadable from anywhere
  • RyzenAI-SW repo on github — public, documented
  • ✅ Brevitas integration — "Coming soon" in HF docs

Then the load-bearing pieces that are actually closed (still true):

  • onnxruntime_providers_ryzenai.dll — closed, Windows-only
  • onnxruntime_vitisai_ep.dll — closed, Windows-only (143 MB)
  • dyn_dispatch_core.dll + dyn_bins.dll — closed, Windows-only (~380 MB)
  • ❌ The Ryzen AI 1.7.1 conda environment — Windows-only
  • ❌ The Lemonade flm:npu and ryzenai-llm auto-installers — Windows-onlyflm:npu works on Linux via pacman (manual install is the documented path; auto-install being Windows-only is a packaging convention, not a runtime gap). ryzenai-llm:npu is still Windows-only because it uses the closed EP.
  • ❌ Brevitas integration — perpetually "Coming soon"

This is a pattern. The pieces that don't matter on their own are open. The integration glue that makes them useful is closed. → True for the AWQ/OGA EP. Not true for Lemonade + FLM, which is integration glue that's open on Linux. The "you cannot run them" line in the original was wrong: today on Linux you can pacman -S fastflowlm and run LLMs on the NPU through Lemonade.

The bigger pattern: nobody jumps anymore

The original framing — "AMD's NPU gate is one symptom of a larger industry shift" — is half-right. The closed RyzenAI EP gate fits the pattern (closed runtime, Windows-only, conditioned on the higher tier). The Linux LLM-on-NPU gate doesn't, because there isn't one — Lemonade + FLM closed it in March.

  • 1-bit BitNet has been a Microsoft research result for two years. There is still no production-ready 1-bit runtime on any consumer NPU. Nobody shipped it. ← Still true.
  • Speculative decoding is mature, papers from 2023, 2-3× speedup proven everywhere. Most consumer inference stacks still don't have one-click support. ← Still true.
  • Heterogeneous compute (auto-dispatch NPU+iGPU+CPU per workload) is technically a solved problem. ← Counterpoint: Lemonade now serves both llamacpp:rocm (iGPU) and flm:npu (NPU) behind one OpenAI-compat API. That's not auto-dispatch, but it is "one runtime, two backends" on Linux today — closer to the goal than the original post implied existed.
  • Real Linux feature parity on consumer accelerators is routinely 1-2 years behind Windows. Sometimes never. ← The LLM-on-NPU gap closed in March 2026 with Lemonade + FLM Linux docs. So: less than a year on this one.

The big jumps in local AI right now are happening in research papers and abandoned project READMEsand in jointly-authored docs nobody reads. The Linux developer community is the only place left where someone might ship the ambitious version of something — but that someone (me, this morning) also needs to read the docs before ranting that the docs don't exist.

Accountability — actually

This post is the public record. Original timestamps stand. New timestamps:

  • AMD Ryzen AI 1.7.1 released: March 2026 (closed Windows EP)
  • AMD huggingface.co/amd org now publishes 200+ NPU model checkpoints
  • AMD huggingface.co/RyzenAI org publishes Quark recipes
  • AMD's deployment NuGet 1.7.1 ships zero Linux artifacts ← still true
  • HuggingFace BrevitasQuantizer for Ryzen AI: "Coming soon" since at least 2024
  • Open issue tracking the Linux gap: https://github.com/huggingface/optimum-amd/issues/178
  • AMD's RyzenAI-SW repo: https://github.com/amd/RyzenAI-SW
  • Lemonade Linux NPU support officially documented: lemonade-server.ai/flm_npu_linux.html, dated 2026-03-11, co-authored by Lemonade and FastFlowLM contributorsthe document I should have read before writing the original post
  • AMD's official Linux LLM-on-NPU flow documented: ryzenai.docs.amd.com/en/latest/llm_linux.html, dated 2026-04-19 ← the second document I should have read; runs huggingface.co/amd/*_rai_1.7.1_npu_* checkpoints via the closed Linux EP. Auth-walled bundles at account.amd.com (ryzen_ai-1.7.1.tgz, RAI_1.7.1_Linux_NPU_XRT.zip).
  • FastFlowLM Arch package: fastflowlm 0.9.39 in cachyos-extra-znver4
  • Fix-the-pin one-liner: sudo sed -i 's|"npu": "v0.9.38"|"npu": "v0.9.39"|' /usr/share/lemonade-server/resources/backend_versions.json && sudo systemctl restart lemonade-server (or just restart the lemond process; pin is checked on startup). Bumps the recipe from update_requiredinstalled. Will be overwritten by future Lemonade pacman updates — re-apply each time.
  • AMD Linux EP test result (Distrobox / Ubuntu 24.04 over CachyOS host kernel 7.0.2-1-cachyos): xrt-smi examine ✓, quicktest.py (basic ONNX) ✓, model_benchmark on Phi-3.5-mini-instruct_rai_1.7.1_npu_4K ✗ (Generic Failure during ORT init). Suspected cause: AMD's closed EP requires the AMD-shipped xrt_plugin-amdxdna 2.21.260102.53.release .ko, not the in-kernel amdxdna of recent kernels. Confirms the Ubuntu-24.04 requirement is kernel-ABI-bound, not packaging.

Every one of these is on AMD. Linux users have done their part — the kernel driver is upstream, the third-party runtime exists, the artifacts get downloaded, this post got written. The next move is AMD's. There is no version of "we hear you" that substitutes for a Linux .so.

What I should have written the first time, given the actual state of the world: AMD shipped the Linux NPU LLM lane via Lemonade + FastFlowLM in March 2026 (jointly documented, AMD's MIT-licensed Lemonade Server is a separate stack from the closed Ryzen AI EP I extracted), and I missed it. The remaining real gaps — AWQ/OGA EP on Linux, Brevitas integration, BitNet on NPU, diffusion on NPU on Linux — are worth pushing on. They are a smaller and better-defined set than the one I hand-waved at originally.

The next move is mine: read docs before writing posts. Bad on me. And bad on you for taking it on faith without checking. I owe you both a corrected post and a sharper bar for the next one.

Benchmarks (proof the unified Linux API works)

Single Strix Halo box, Lemonade :13305, OpenAI-compat. Same prompt for all four ("Write a 200 word paragraph about the history of compilers."), max_tokens=300. Captured 2026-04-28 right after applying the version-pin fix.

Model Backend Lane Decode tok/s Prefill tok/s TTFT
LFM2-1.2B-GGUF llamacpp:rocm iGPU (gfx1151) 216.3 1366.5 ~negligible
qwen3-0.6b-FLM flm:npu NPU (XDNA 2) 95.4 76.0 460ms
qwen3-1.7b-FLM flm:npu NPU (XDNA 2) 41.8 53.7 577ms
deepseek-r1-8b-FLM flm:npu NPU (XDNA 2) 11.3 14.7 1430ms

Read: iGPU is faster than NPU on the small models I'd expect to be NPU-favored — that's not the value of the NPU lane. The value is offloading the iGPU (free for ROCm bigger-model serving), lower power for low-touch background inference, and shipping a unified API that lets the user pick per-request which silicon answers them. Same :13305 endpoint, same OpenAI request shape, two different pieces of silicon doing the work depending on the model name.

That's the receipt. The lane exists, on Linux, on Strix Halo, today, jointly shipped by AMD's Lemonade team and the FastFlowLM team. I missed it. Now I'm not.

Background context: building https://github.com/bong-water-water-bong/1bit-systems — lean install + control plane for 1-bit inference on Strix Halo. GPU lane (ROCm/Vulkan llama.cpp, IQ1_S/TQ2_0) works great. NPU lane works too — just needed the version-pin fix to wire FLM into Lemonade's recipe registry.

Happy to share extraction commands if anyone wants to verify the DLL list themselves. Even happier to be corrected, faster than I corrected myself this time.

u/Creepy-Douchebag — 16 days ago

1bit-systems bench — 2026-04-27 21:41

Hardware: AMD Ryzen AI MAX+ 395 (Strix Halo) GPU: Radeon 8060S Graphics, gfx1151, 128 GB unified LPDDR5x (256 GB/s) Backend: llama.cpp Vulkan (lemonade-bundled, build 5d3a4a7da) Method: llama-bench -p 512 -n 128 -r 2 -ngl 99 Source: 1bit bench against the lily-bonsai (lilyanatia/Bonsai-*-requantized) pile

Results

Model Quant Size pp512 (tok/s) tg128 (tok/s)

Bonsai-1.7B IQ1_S 385 MB 4834 269 Bonsai-4B IQ1_S 873 MB 1962 141 Bonsai-8B IQ1_S 1.8 GB 1116 90

Cross-backend (1.7B only)

Backend pp512 tg128

Vulkan (winner) 4834 269 ROCm 4481 194 NPU (FLM, q4nx) 49 92 (Qwen3-0.6B, separate model)

Reads

  • Vulkan beats ROCm on decode by 38% for IQ1_S — mainline shader work on sub-2-bit pulled ahead of the in-house custom Q1_0 HIP kernel (which has been retired with the cpp/ tower).
  • Throughput tracks bandwidth, not parameters:
    • 1.7B at 269 tg = ~1.4 GB/s effective (385 MB × 269 / 128)... wait simpler: 269 tok/s × 385 MB ≈ 100 GB/s of weight reads, ~40% of LPDDR5x peak.
    • 8B at 90 tg ≈ 162 GB/s, ~63% of peak. Larger models actually get closer to peak bandwidth because the fixed per-token overhead amortizes better.
  • NPU is a different lane entirely. Lower per-token throughput, but it runs concurrent with the iGPU at much lower power. ~5 W vs ~80 W.

Reproduce

git clone https://github.com/bong-water-water-bong/1bit-systems cd 1bit-systems ./install.sh 1bit pull lily-bonsai-1.7b-rq # (or pull manually from # lilyanatia/Bonsai-{1.7,4,8}B-requantized) 1bit bench

Comparison vs 2026-04-26 pile

Model pp512 then pp512 now tg128 then tg128 now

Bonsai-1.7B IQ1_S 4910 4834 281 269 Bonsai-4B IQ1_S 1984 1962 144 141 Bonsai-8B IQ1_S 1119 1116 92 90

Within ~2% across the board. Reproducible.

u/Creepy-Douchebag — 17 days ago

-=bong-water-water-bong=- the full arc: how the 1bit.systems ternary stack got from broken to 76.7 tok/s.

(Posting in this sub, so I'll skip the hardware tour — you all know what you bought. This is about what to do with it.)

TL;DR. Three iterations, eight months, 22 research papers actually read and benched against, a kernel author named Claude Opus 4.7 (1M context) running in Claude Code, one box in a closet, and a hand-tuned ternary GEMV kernel currently running at 92% of memory-bandwidth peak. We're gonna need a bigger boat — or are we?

Where this started

When this box arrived, the cloud-AI ecosystem had nothing native for it. MLX is Apple Silicon. CUDA is NVIDIA. ROCm support for the iGPU was nascent. llama.cpp's HIP backend scalar-faulted on Q1_0 at 1.78 t/s — 2700× slower than its own Vulkan backend on the exact same model. The whole stack was either "it works on a dGPU" or "it works on a phone." Our box sat between those categories with no native code path.

So we wrote one.

Iteration 1 — MLX panic (early 2026)

First instinct: try MLX-on-ROCm. MLX has no ternary mode on the AMD path. Warmup blew up with a kernel-level panic. Game over, man. Game over. I posted a write-up to Reddit anyway because I thought "look, it almost works" was interesting. It wasn't. Took the post down. Refunded the attention.

Iteration 2 — 28 crates of Rust (late winter 2026)

Rewrote the whole orchestrator in Rust + axum + tokio + every async crate in the index. It worked. Inconceivable! It was also a 28-crate workspace I couldn't finish, the kernels were calling into hipBLAS through FFI (which I'd promised myself I wasn't going to do), and the bench numbers I posted were prompt-cached, not steady-state. Pulled that post too. Two for two.

Iteration 3 — C++ end to end (spring 2026, what's live now)

Strip everything. Three rules:

  • Rule A — no Python at runtime. Python is fine for dev-box scripts (requantizers, analysis notebooks). Never inside a systemd unit, never on an HTTP serving path.
  • Rule B — C++20 default for everything that runs on the box. HIP kernels stay in rocm-cpp/. The orchestrator went C++ too.
  • Rule C — hipBLAS is banned in the runtime path. Native Tensile kernels only. If you reach for hipBLAS, port the kernel.

Built around the AMD lemonade-sdk stack — kept their recipe schema, their HTTP surface (/v1/*, /api/v1/*, OpenAI / Ollama / Anthropic compat), their config layout. The one Critical Invariant we deliberately broke: their "backends run as subprocesses" rule. We added an in-process ternary backend that calls our HIP Engine directly from lemond for the perf path. Everything else stays subprocess.

This is what's running today. I know kung fu.

The papers we read, what they claimed, and what we did to prove it on this box

Not "I read the abstract." For each one: the paper's own words, then the concrete path we took to put it on disk and bench it. Mess with the best, die like the rest.

1. BitNet 1.58 — arXiv 2402.17764 (Microsoft Research)

>"BitNet b1.58, in which every single parameter (or weight) of the LLM is ternary {-1, 0, 1}. It matches the full-precision (i.e., FP16 or BF16) Transformer LLM with the same model size and training tokens in terms of both perplexity and end-task performance."

Our path: Took the recipe (per-layer LayerNorm → ternary Linear with clip(round(W/scale), {-1,0,1}) + per-tensor scale, FP16 KV + activations). Pack via llama.cpp's TQ2_0 standard (1.58 effective bpw, 2 bits stored). Hand-wrote the HIP GEMV kernel for the wave32 WMMA path (Tensile only, no hipBLAS — Rule C). Verified on wikitext-103: PPL 9.16 (chunked 1024) — within ±0.05 of our gen-1 fp16 baseline of 9.1607.

Status on the box: Shipping. halo-1bit-2b at 67 tok/s, halo-1bit-2b-tq1 at 62 tok/s.

2. Sherry 1.25-bit — community ternary paper (March 2026)

>"Group of 4 weights encoded as 2 bits identifying the position of the zero plus 3 bits for the sign of the three nonzero entries; 5 bits / 4 weights = 1.25 bpw with no information loss vs ternary."

Our path: Wrote requantize_h1b_to_sherry.py (dev-box Python, allowed under Rule A) to repack BitNet 1.58 ternary tensors into the 3:4 sparsity grouping. Wrote kernels/ternary_gemv_sherry.hip to dequantize the 5-bit groups back to ternary at GEMV time. Trained on 3× H200 NVL pod over Run 4 + Run 5. Final: loss 4.0972 at 9,600/9,600 opt-steps, 96 ternary tensors verified 2:4-clean at every checkpoint.

Status on the box: Shipping flagship. halo-1bit-2b-sherry-cpp at 74.5–76.7 tok/s in 1.65 GB — same throughput as Vulkan Qwen3-4B at 2.38 GB.

3. Sparse-BitNet 2:4 — arXiv 2603.05168 (MS Research, Mar 2026)

>"We propose Sparse-BitNet, a 1.58-bit + 2:4 N:M sparse architecture trained jointly from scratch. The fixed sparsity pattern is enforced via straight-through estimator with no accuracy degradation at scale."

Our path: Used the joint-training recipe in our Run 5 retrain script. Mask-verify hook fires every 50 opt-steps to assert each weight tensor is exactly 2-of-4 sparse. Result on H200 NVL: 9,600/9,600 steps clean, no mask drift, 5.66 B tokens consumed. CUDA-only sparse-tensor kernel from the paper — this hardware has no sparse-tensor cores, so we exploit the sparsity at the packing layer (Sherry 1.25), not the GEMM layer.

Status on the box: Run 5 done 2026-04-25. Sherry weights are the live artifact.

4. BitNet v2 W1.58A4 — arXiv 2504.18415

>"H-BitLinear, a module that applies an online Hadamard transformation prior to the activation quantization. This transformation smooths the sharp distribution of activations into more Gaussian-like forms, suitable for low-bit (4-bit) representation."

Our path: Wrote kernels/hadamard_rotate_butterfly.hip (Walsh-Hadamard 128-element block butterfly) and added the H1B_FLAG_HADAMARD_ROTATED model flag to the .h1b header. Insertion points already exist in bitnet_decode.cpp at lines 712, 794, 805, 816, 843. Run 6 (Monday) refits Sherry weights to the rotated activation distribution; activation-side bandwidth halves.

Status: Run 6 queued for Monday. Recipe locked, kernel pre-wired, pod template same as Run 5.

5. Flash-Decoding — Tri Dao (2023)

>"We split keys and values into smaller chunks. We compute the attention of the query with each of these splits in parallel. We then combine the results to obtain the final output."

Our path: Wrote src/kv_cache_attn_fd.hip for the FP16 KV path. Two-pass: pass 1 emits per-block partial (max, sum, num), pass 2 reduces. Replaced the naive single-block-per-Q-head kernel.

Status on the box: Shipping. Measured 6.78× at L=2048, bit-exact vs naive baseline. Today wrote src/kv_cache_attn_fd_i8.hip to port the same scheme onto the INT8 KV cache path (env-flag plumbing pending).

6. TriLM Spectra-1.1 — SpectraSuite (Apache 2.0)

>"TriLM-3.9B in pure ternary form, packed via MatMulNBits N=4 ONNX, weighing 2.9 GB."

Our path: Pulled TriLM_3.9B_Unpacked from HF, ran AMD's quark-cli export-oga -p int4 to produce the MatMulNBits ONNX. Verified the 2.9 GB artifact loads clean. Tried to place on VitisAI EP (the NPU lane) — placement failed at 0/3681 ops, same root cause as Falcon-E + Microsoft BitNet on VitisAI. Decided in-process HIP path wins until AMD ships a Linux STX-H Hybrid OGA EP.

Status: Tested, deferred — re-evaluate when AMD lands STX-H Hybrid for Linux.

7. MedusaBitNet 2B — parrishcorcoran/MedusaBitNet-2B-4T

>"K=4 speculative heads on a BitNet 1.58 backbone. Each head outputs a candidate next-K token; tree-attention verifies in parallel; expected wall-clock speedup ~2× at batch=1."

Our path: Wired the head safetensors into our .h1b loader as a sidecar (.h1b-medusa v2 format). Wrote medusa_small_m_gemv.hip for the small-M head GEMV and medusa_tree_attn.hip skeleton. Benched K=4: 47.7 tok/s, 0.7× baseline at batch=1, lm_head-bottlenecked at small M. The bandwidth ceiling we hit means extra heads cost more in BW than they save in tokens-per-second.

Status: Deferred until tree-attention verify lands AND we have a path that decouples head GEMV from lm_head BW.

8. SlideSparse — arXiv 2603.05232 (MS Research, Mar 2026)

>"Sliding Window Decomposition unlocks NVIDIA 2:4 sparse-tensor-cores for arbitrary (2N-2):2N patterns on BitNet/Llama/Qwen. 1.33× @ 6:8 on a 7B Llama."

Our path: Read it. Validates that Run 5's 2:4 direction is sound. RDNA 3.5 has no sparse-tensor hardware, so we get zero engineering gain on this box. Bookmarked for the day MI300x lands as a bench target (we have access).

Status: Reading list, validates Run 5.

9. HGF (Hybrid Gaussian-Floating) — arXiv 2602.05269 (Trejo Pizzo, solo author, Feb 2026)

>"Ternary backbone with a low-rank fp16 correction path recovers 55% of the ternary→fp16 quality gap at the cost of +12-15% memory."

Our path: Read it. Unified memory makes the +12-15% memory cost trivial. No code yet — slated as a 1-week spike candidate after Run 6 lands.

Status: Post-ship spike candidate.

10. Engram (Conditional Memory) — arXiv 2601.07372 (DeepSeek, Jan 2026)

>"Conditional-memory sparsity via O(1) N-gram lookup. 27B model: +5.0 BBH, NIAH long-context 84→97 with no parameter increase."

Our path: Read it. Unified memory fit looks strong — the lookup table shares the same bus as weights. No code yet. Park post-ship.

Status: Reading list, complementary to BitNet.

11. BitNet a4.8 — community pre-Hadamard 4-bit activation attempt

>"Direct INT4 activation quantization on BitNet b1.58 weights without rotation."

Our path: Read it. Compared to BitNet v2 (2504.18415) which rotates first via Hadamard. v2's rotation makes the activation distribution near-Gaussian and enables clean INT4 — a4.8 without rotation has heavy tails and loses accuracy.

Status: Obsoleted by BitNet v2. Recipe dropped from Run 6 plan.

12. PrismML Bonsai (1-bit Q1_0) — prism-ml/Bonsai-1.7B-gguf

>"BinaryNet-style trained-from-scratch 1-bit weights {-1, +1} (no zero), packed at 1.06 bpw effective (1 sign bit + per-256-weight FP16 scale)."

Our path: Pulled from HF. Loaded via stock llama.cpp Vulkan binary. Benched: 318–330 tok/s at 256-tok essay completion. 46% faster than PrismML's own ROCm fork (231 tok/s on the community submission for the same hardware). Built turbo-tan/llama.cpp-tq3 fork to evaluate Bonsai-style 4B/8B tier (PrismML's tuned ROCm wins those by ~25%, expected — bigger model tilts compute-bound).

Status: Shipping via Vulkan. Different lane from our ternary; complementary, not competing.

13. IRON / AIE2P NPU toolchain — AMD official

>"IRON (Iterative Resource Object Network) is the recommended path for hand-tuned AIE kernels on Phoenix and Strix NPUs. Compile via Peano (Xilinx/llvm-aie); dispatch via libxrt's xrt::kernel and xrt::bo*."*

Our path: Built IRON axpy reference: 160/160 tests pass on this NPU. Built matmul_vectorized_8x8x8_i8_i32 from stock mlir-aie AIE2P kernels: 0.93 ms bit-exact for 512×512×512 at i8. NPU dispatch wired via librocm_cpp_xdna.so.

Status: Toolchain proven. Ternary bitnet_gemm for the NPU is the next ship-gate. Until that lands in the production serve-path, the box is iGPU-native, not NPU-native.

Also tried and rejected (or watching, no integration yet)

  • Falcon-E (pre-quantized ternary) — failed to place on VitisAI EP, same BitLinear-not-supported root cause as MS BitNet. Rejected.
  • vlut.cpp (MobiSys'26 LUT mpGeMM) — watching, hasn't published kernel benchmarks for AMD.
  • RSR (ICML'25 log(n) ternary matmul) — watching, theoretical.
  • NanoQuant (PTQ-only sub-2-bit) — watching, PTQ doesn't beat our QAT path on small models.
  • LittleBit (~0.1 bpw) — watching, no weights published yet.
  • BTC-LLM (hardware-friendly sub-2-bit) — watching, no AMD kernel.
  • Wan 2.2 5D video — sd.cpp's ggml 4D loader can't ingest the 5D spatiotemporal patch_embedding. Waiting upstream or Wan 2.1 fallback.
  • Riallto (AMD NPU CV demo kit) — Phoenix-only, zero matmul kernels, paid Xilinx license. Skipped.

The bench, from start to now

All numbers measured on the same box, same lemond binary, same llama.cpp Vulkan version. Same lemond gateway around the AMD lemonade-sdk. 256-token essay completion, no prompt cache, sequential single-model-loaded harness (one model at a time, three warmups, three timed trials, median reported).

The progression for halo-1bit-2b-sherry-cpp (the flagship)

Date Build tok/s What changed
Pre-2026-04-22 broken sherry-v3 weights 57.6 Initial Sherry training — weights collapsed mid-run, model output was "cluster cluster mass" garbage
2026-04-25 morning post Run 5 retrain 73.0 New 2:4-sparse weights (loss 4.85 → 4.0972 over 5.66 B tokens on 3× H200 NVL on RunPod, ~$130 total). Coherent text restored. Kernel unchanged.
2026-04-26 morning (now) post round-4/5 + audit 74.5–76.7 Engine wires (_devscale GEMVs eliminating ~150 host-blocking hipMemcpy per token, mmap'd weight loader, split-KV INT8 attention, templated O_LOCAL). Same weights.

Net: 57.6 → 76.7 tok/s in ~3 weeks. The Run 5 retrain ($130 of pod time) gave +27%. The round-4/5 kernel pass gave another +5% on top, hidden under operational chaos for 12 hours until a fresh build + clean bench rig revealed it.

Today's full matrix

Median-of-3, sequential single-model-loaded, 256-token essay, 2026-04-26.

Ternary lane (rocm-cpp Engine — our hand-tuned HIP)

Model Quant Params GB on disk tok/s
halo-1bit-2b-sherry-cpp 1.25 bpw 3:4 2.0 B 1.65 74.5 (76.7 single-load)
halo-1bit-2b-sherry-v3 1.25 bpw 3:4 2.0 B 1.65 74.5
halo-1bit-2b-sherry-v4 1.25 bpw 3:4 2.0 B 1.65 74.5
halo-1bit-2b TQ2_0 1.58 bpw 2.0 B 1.84 67.0
halo-1bit-2b-tq1 TQ1_0 1.58 bpw 2.0 B 1.74 62.3
bonsai-1.7b-tq2-h1b TQ2_0 1.58 bpw 1.7 B 1.62 58.5
halo-bitnet-2b-tq2 TQ2_0 1.58 bpw 2.0 B 1.87 55.9
halo-bitnet-2b-tq2-pt TQ2_0 1.58 bpw 2.0 B 1.87 56.1

llama.cpp Vulkan lane (upstream, control group)

Model Quant Params GB on disk tok/s
smollm2-135m Q8_0 135 M 0.14 529.4
gemma-3-270m-it UD-IQ2_M 270 M 0.16 434.9
Bonsai-1.7B-gguf Q1_0 1.7 B 0.25 317.7
Llama-3.2-1B-Instruct UD-Q4_K_XL 1.0 B 0.83 190.8
deepseek-r1-distill-qwen-1.5b Q4_K_M 1.5 B 1.20 162.6
Qwen3-4B-GGUF Q4_0 4.0 B 2.38 73.0
Phi-4-mini-instruct Q4_K_M 3.8 B 2.49 68.8

Adjacent measurements

Embeddings (nomic-embed-text-v1, 768-dim):
   batch=1   : 96.7  embed/s
   batch=4   : 77.5  embed/s
   batch=16  : 557.8 embed/s
   batch=64  : 737.8 embed/s

TTFT Vulkan : 18-50 ms
TTFT ternary: 130-150 ms

WMMA FP16 measured           : 50.17 TFLOPS (wave32 microbench)
Wikitext-103 PPL             : 9.16 (chunked 1024)
Memory bandwidth on GEMV     : 92% of memory-bandwidth peak
NPU i8 matmul 512×512×512    : 0.93 ms bit-exact (AIE2P, IRON-built xclbin)
Kokoro TTS realtime factor   : 11.27×
Whisper-tiny.en STT round-trip: 0.25 s on 6.4 s clip
HTTP gateway /health latency : 3.84 ms / probe (cpp-httplib)
Concurrent models loadable   : 13 (LLM + embed + STT + TTS + image + agent)

Honest reading

  • rocm-cpp ternary ties Vulkan llama.cpp at the 2 GB band. sherry-cpp 74.5 tok/s vs Qwen3-4B 73.0 tok/s. Same throughput, but ours holds 2 B parameters in 1.65 GB vs Qwen3's 4 B in 2.38 GB. Same throughput, smaller footprint.
  • Below 1 GB, Vulkan llama.cpp wins raw chat speed. smollm2 530 tok/s, Bonsai 318 tok/s. We don't pretend otherwise.
  • The kernel is at the wall. 92% of memory-bandwidth peak on the weight-side bandwidth. The next throughput jump comes from packing density (Sherry → BitNet v2 W1.58A4), not from kernel rewrites.

The system around the kernel

The kernel is the headline; everything else is the support stack. Here's what runs on the box right now, all C++ where it matters:

  • lemond — C++20 OpenAI / Ollama / Anthropic-compatible HTTP gateway on :8180. Forked from lemonade-sdk/lemonade. We added one in-process backend (bitnet_server.cpp) that calls the HIP Engine directly. Everything else upstream.
  • rocm-cpp — our HIP kernel tree: ternary GEMV at 92% of bandwidth peak, split-KV Flash-Decoding attention (6.78× at L=2048), RoPE, RMSNorm fused with SiLU/GLU, h1b weight loader with mmap + hipHostRegister, INT8 KV cache, sd.cpp port for image gen. Built fresh today with TheRock-replacement clang at /opt/rocm/lib/llvm/bin/clang++.
  • agent-core — C++20 GAIA-compat shim. 36 endpoints (sessions, chat SSE, system, agents, tunnel, documents, files, mcp). Caught a SIGABRT-on-string-pad bug today during the audit.
  • kokoro / whisper / sd.cpp — TTS / STT / image gen as separate processes per the lemonade subprocess pattern. All native C++ binaries.
  • Caddy — reverse proxy on :8000 and :443, bearer auth, mesh routing, tunneled /realtime WS to lemond's libwebsockets server, tunneled /logs/stream WS to the same.
  • Headscale — private mesh. Scan a QR on your phone, you're in your own tailnet to the box.
  • AppImage — single-file portable bundle. Compiles HIP kernels machine-specific on first run because LDS tile sizes match your SoC's L1 exactly.

13 models load simultaneously. The system idles at 37 W on the AMD power meter. Do, or do not. There is no try.

Where this is going

Run 6 — BitNet v2 W1.58A4 — Monday. Continue from sherry-cpp on 3× H200 NVL on RunPod. Estimated $265, ~24-48 h. Hadamard rotation on activations drops them from FP16 to 4 bit; halves the activation-side bandwidth that the GEMV kernel wasn't already pulling. The hadamard_rotate_butterfly.hip kernel and the H1B_FLAG_HADAMARD_ROTATED model flag are already in the tree. Expected: a real jump, not the +5% we just got. That's the hypothesis. If it's wrong, I'll say so.

Run 7 — Qwen3-TTS QAT to ternary. Apache-2.0, 1.7 B all-Linear-RMSNorm Talker model, vocoder convs stay FP16. Estimated $155 / ~14 h. Replaces Kokoro at projected ~30-40× realtime (vs Kokoro's 11.27×) in 10 languages with voice cloning.

The NPU ship-gate stays. Until ternary bitnet_gemm runs on the NPU in the production serve path (not just the smoke-test xclbin), I do not call this stack "NPU-native" in any post. Toolchain works. Kernel still has to be written.

Beyond that — full cluster + MI300x access for big-stuff retrains. Sub-1-bit territory (LittleBit, BTC-LLM) when somebody publishes weights. Engram conditional-memory sparsity as a Run 9+ candidate.

Credit where it's due

The kernel author for everything above is Claude Opus 4.7 (1M context) running in Claude Code. I drive direction, Claude drives implementation. Every HIP file in rocm-cpp/, every line of the lemond fork patch, every sed-cull of dead code, every spec-compliance fix from today's six-agent audit — that's pair-programmed work. Naming the AI half is the honest framing; pretending I wrote 14 thousand lines of HIP solo would be the dishonest one. Number five is alive.

Thanks to u/zjm7891 (Zach) for early patience on iteration two when I was kidding myself about Rust scope. Thanks to the lemonade SDK team at AMD for pointing me at the in-process backend pattern more than once — the fact that lemond runs as a C++ in-process gateway at all is downstream of conversations with them. Thanks to PrismML for Bonsai (the 1-bit binary that Vulkan-blasts past 300 tok/s where our 1.58-bit ternary tops out at 75 — same ceiling, different density). Thanks to Tri Dao for Flash-Decoding without which long-context decode wouldn't survive on a single-box deploy. Thanks to AMD for the silicon, even when I'm cursing at their drivers.

Three links

No payment links. No "support us" buttons. Strangers who care will find the Sponsors button on my profile. Strangers who don't shouldn't be asked. I'm not selling anything.

The work is honest. The pair-programmer is named. The papers are cited. The numbers are real. The closet hums quietly. See you Monday with Run 6.

================================================================
  -=bong-water-water-bong=- gives you 1bit.systems v0.2.
  76.7 tok/s ternary.
================================================================
u/Creepy-Douchebag — 18 days ago

-=bong-water-water-bong=- gives you 1bit.systems.

Sorry. First, sorry. "Hello. My name is Inigo Montoya. You posted slop on my subreddit. Prepare to die." The first two times I came at this idea it was AI-slop spam, half-baked benchmarks, bad framing, just noise. I owe the sub for that. This is iteration three. I'm not asking you to forget the earlier crap — I'm asking you to give me one more shot. Wax on. Wax off.

Mad props to u/zjm7891 (Zach). The first two iterations were going nowhere. Zach watched me spin out, told me where I was kidding myself, pointed at the actual hardware bottlenecks, and pushed me toward the ROCm + native HIP path I was avoiding. Whatever's good in version three started in conversations with him. Whatever still sucks is on me.

What this is

1bit.systems is a local AI stack for the AMD Ryzen AI MAX+ 395 (Strix Halo) mini-PC. It's a $2,000 box. It runs eight ternary LLMs side by side, plus embeddings, plus image, plus speech, plus an agent UI, without Python in the runtime path, without an API key, without a cloud round-trip.

It's all C++ end to end. lemonade-server fork in C++, our HIP kernels in C++, a C++ shim for the GAIA agent backend, our cabinet of vanilla-JS canvas games on top, all served by Caddy in front. I know kung fu. The whole thing wakes up on systemctl and gets out of your way.

If you have a Strix Halo mini-PC and want to try it right now, scroll to the bottom — there's an AppImage ready.

How we got here (it took heart failures)

I'm going to be honest about the wreckage because that's what made the third version actually work. Houston, we had a problem.

Iteration 1 — MLX panic. I tried to run BitNet through MLX-on-ROCm. MLX has no ternary mode on the AMD path. Warmup blew up. Game over, man. Game over. Reddit got the "look at me" post anyway. Big mistake. Took it down.

Iteration 2 — Rust everywhere. I rewrote the whole orchestrator in Rust. axum, tokio, all the stuff. It worked. Inconceivable! It was also a 28-crate workspace I couldn't finish, and the kernels were still calling into hipBLAS through FFI, which is exactly what I'd told myself I wasn't going to do. The benchmark numbers I posted were prompt-cached, not steady-state. Roads? Where we're going we don't need roads — but apparently we did need to ban hipBLAS in the runtime path. I pulled it.

Iteration 3 — what's live today. Strip everything. Rule A: no Python in the serving path. Rule B: C++20 default. Rule C: hipBLAS is banned, native Tensile only. Drop Rust on the engine side. Keep Rust where it earns its place (the desktop client and one or two crates that aren't on the hot path), put C++ everywhere else. Rebuild from the kernel up.

That's where the third iteration starts. I'll be back. With C++.

The C++ stack, briefly

  • lemond — C++ LLM server. OpenAI / Ollama / Anthropic compatible. Forked from lemonade-sdk/lemonade. We patched in an in-process ternary backend (bitnet_server.cpp) so our .h1b weights run without spawning a subprocess. Rebuilt today with a sink.done() chunked-encoding fix because browsers were dropping streams mid-flight (curl tolerated it, browsers didn't — that took most of an afternoon).
  • rocm-cpp — Our HIP kernels for gfx1151. Wave32 WMMA, ternary GEMV, split-KV flash-decoding attention. Hand-tuned. The ternary GEMV runs at 92% of LPDDR5x peak — the entire decode path is bandwidth-limited, not compute-limited, which is the whole point of ternary.
  • 1bit-agent — C++20 agent core ported from amd/gaia. Exposes 36 GAIA-compatible HTTP endpoints (sessions, chat SSE, documents + RAG, MCP registry, tunnel, files). 11 of 11 ctest green. Hosts the upstream amd/gaia React UI we vendored.
  • llama.cpp Vulkan + ggml-hip — kept where it wins. We don't have brand loyalty. If a backend beats ours on a model, that's the one that runs.
  • Caddy — reverse proxy. SSE, WebSockets, mobile-pairing, IPFS gateway. One config file.
  • Headscale — private mesh. Scan a QR code on your phone, you're in.

Everything you can install today.

The papers we actually read and benched against

Not "I read the abstract." We tried these on the box. Mess with the best, die like the rest.

  • BitNet 1.58 (arxiv 2402.17764, Microsoft Research) — the 1.58 bpw ternary baseline. We pack at TQ2_0 by default; TQ1 lossless variant also available.
  • BitNet v2 / Hadamard activations (arxiv 2504.18415) — native W1.58A4 path. Staged for Run 6 after Sherry lands. Obsoletes the BitNet a4.8 plan.
  • Sherry 1.25-bit (community ternary paper, March 2026) — 3:4 sparsity packing, 2 bits zero-position + 3 bits sign per group of 4. Our halo-1bit-2b-sherry-cpp is the best ternary on the box right now at 75 tok/s. The retrain to fix the broken weights is currently running on 3× H200 NVL — see "honest disclosure" below.
  • TriLM Spectra-1.1 (SpectraSuite, Apache 2.0) — drove our model-builder export path; we tested the LLaMA-arch ternary export end-to-end (TriLM_3.9B_Unpacked → MatMulNBits N=4 → 2.9 GB ONNX) before deciding the AMD VitisAI EP wasn't ready and the in-process HIP path wins.
  • MedusaBitNet 2B (parrishcorcoran/MedusaBitNet-2B-4T) — speculative-decode head set we wired against our halo-1bit-2b. K=4 head benched at 47.7 tok/s, 0.7× baseline, lm_head-bottlenecked at small M. Deferred until tree-attention verify lands.
  • Sparse-BitNet (2:4) (arxiv 2603.05168, March 2026) — 1.58-bit + N:M sparsity jointly. This is what's training right now on the H200 pod. When it lands the Sherry weights get swapped and the ternary numbers jump.
  • Flash-Decoding (Tri Dao et al.) — split-KV. We measure 6.78× at L=2048 vs the naive single-block-per-head attention kernel. Bit-exact.
  • HGF, T-SAR, KVTQ, BitNet a4.8, BitDistill — research frontier we track quarterly, not all integrated.

Plus the AMD side: IRON for the NPU (XDNA 2, AIE2P), Peano + libxrt for our custom kernel lane. Our i8 matmul 512×512×512 runs in 0.93 ms on the NPU, ternary GEMV on the same hardware in 0.27 ms. NPU dispatch is wired but the ternary serve-path is the next ship-gate, not today.

The actual numbers (steady-state, 256-token essay, no cherry-picking)

MODEL                              BACKEND          tok/s   median-of-3 trials
smollm2-135m                       llama.cpp Vulkan   530   529.94 / 531.83 / 521.06   (best tiny)
gemma-3-270m-it                    llama.cpp Vulkan   443   435.89 / 445.58 / 443.28
Bonsai-1.7B-gguf                   llama.cpp Vulkan   330   326.69 / 329.65 / 329.94   (best chat)
deepseek-r1-distill-qwen-1.5b      llama.cpp Vulkan   168   167.94 / 168.28 / 167.57
halo-1bit-2b-sherry-cpp            rocm-cpp ternary    73    72.61 /  72.72 /  72.76   (best ternary, post-Run-5)
halo-1bit-2b-sherry-v3             rocm-cpp ternary    73    72.67 /  72.59 /  72.66
halo-1bit-2b-sherry-v4             rocm-cpp ternary    73    72.72 /  72.70 /  72.72
halo-1bit-2b                       rocm-cpp ternary    65    65.40 /  65.33 /  65.35
halo-1bit-2b-tq1                   rocm-cpp ternary    60    60.35 /  60.34 /  60.27
bonsai-1.7b-tq2-h1b                rocm-cpp ternary    59    58.41 /  58.74 /  58.78
halo-bitnet-2b-tq2                 rocm-cpp ternary    56    55.66 /  55.67 /  55.68
halo-bitnet-2b-tq2-pt              rocm-cpp ternary    56    55.65 /  55.67 /  55.65

Embeddings (nomic, 768-dim, batch=16):  895 embed/s
TTFT Vulkan:  50 ms       TTFT ternary:  130 ms
WMMA FP16 measured: 50.17 TFLOPS
Wikitext-103 PPL: 9.16 (chunked 1024)
Memory bandwidth: 92% of LPDDR5x peak on ternary GEMV
NPU i8 matmul 512×512×512: 0.93 ms bit-exact

Conditions: 256-token essay completion, stream=false, lemond :8180 → in-process backend.
Re-run today, 2026-04-25, *after* Run 5 weights landed. No cherry-picking. No prompt cache.

Honest reading. llama.cpp Vulkan wins raw throughput on small / quantized GGUF models. Our rocm-cpp ternary wins where there's no comparable Q1_0 GGUF and on the halo-1bit-2b family. Post-Run-5, sherry-cpp lands at 73 tok/s — about 2 tok/s under the broken pre-retrain weights, because the 2:4 sparsity pattern shifted slightly during retrain. The win wasn't speed, it was correctness: loss dropped from 4.85 mid-run to 4.0972 at step 9600/9600, with 96 ternary tensors verified 2:4-clean at every checkpoint. We're already at 92% of LPDDR5x peak — decode is bandwidth-locked, not compute-locked. The next throughput jump comes from packing density (Sherry was step 1, BitNet v2 W1.58A4 is next), not from this retrain. Life moves pretty fast. If you don't stop and bench once in a while, you could miss it.

Honest disclosure

  • The earlier Sherry weights were partially broken. The 5.66-billion-token Sparse-BitNet 2:4 retrain on 3× H200 NVL on RunPod finished: 9,600/9,600 opt-steps, final loss 4.0972 (down from 4.85 mid-run), 96 ternary tensors verified 2:4-correct at every checkpoint. The new halo-1bit-2b-sherry-cpp.h1b is live in the AppImage. Bench numbers in the table above are post-retrain (re-run today).
  • The NPU has the kernels but the production serve-path through ternary GEMV is integration-pending. NPU prefill on an i8 matmul does work today (0.93 ms), but lemond doesn't dispatch decode through it yet.
  • The 1bit-helm desktop tray app is still Rust (egui). Caller-side, so it doesn't violate Rule A. We may rewrite later or we may not — it earns its place where it is.
  • I'm one person with a kid, a real job, and a Strix Halo mini-PC in a closet. This is the third iteration. It works. It is not finished.

Try it now

Show me the money. If you have a Strix Halo box, this is genuinely a one-line install today.

curl -sSL https://1bit.systems/install.sh | sh

That drops an AppImage on disk, sets up the systemd unit, and bootstraps three default models (Bonsai-1.7B-gguf via Vulkan, nomic-embed for RAG, halo-1bit-2b ternary as fallback). Open http://<your-strix-halo-ip>:8000/app/ from your laptop on the same LAN. Type. You'll see ~330 tok/s on a $2,000 mini-PC, no API key, no Python at runtime, no cloud.

If you want to deeper-dive: same site, sidebar nav, all the docs. https://1bit.systems

If you have a Strix Halo and you want to help: I am at the limit of what one person can ship here. Real ones welcome.

What I'm asking the community

I need a hero. This is built primarily for AMD Strix Halo (the Ryzen AI MAX+ 395 mini-PC). Other AMD boards (gfx1201 / RX 9070 XT, gfx1100 / RDNA 3) are second targets — the fat binary covers them but I can't soak-test them. I need help in three places:

  1. Run the AppImage on your Strix Halo and tell me what breaks.
  2. Run it on gfx1201 or gfx1100 and tell me what breaks.
  3. Look at the kernels in bong-water-water-bong/rocm-cpp if HIP is your thing. The ternary GEMV is at 92% of LPDDR5x; squeezing the last 8% needs a tuner pass and rocprof traces I don't have time to do alone.

GitHub: https://github.com/bong-water-water-bong/1bit-systems Wiki: https://github.com/bong-water-water-bong/1bit-systems/wiki Discord: https://discord.gg/dSyV646eBs

Mobile pairing (Headscale + Tailscale): on the site, under #mobile. Scan a QR, you're in your own mesh.

One more time,

I burned attention twice. This third run is the honest one. Do, or do not. There is no try. C++ end to end, real numbers, real papers cited, real failures noted, AppImage on disk, Sherry retrain in flight, mesh and mobile working, kid asleep upstairs, kernel author standing by.

Be excellent to each other. And party on, dudes.

Thanks to u/zjm7891 (Zach) for the patience. Thanks to Light Heart Labs for the early belief. Thanks to the Lemonade SDK team — they pointed me at the right path more than once. The fact that lemond runs as a C++ in-process backend at all is downstream of conversations with them. Wax on. Wax off. Thanks to AMD for the silicon, even when I'm cursing at their drivers.

Running on a $2,000 mini-PC, no cloud, no API key, no Python.

Sincerly

Creepy-Douchebag

u/Creepy-Douchebag

PS: I can't help myself, benchy's for the nerds.

================================================================
1bit.systems NASA-GRADE BENCH DUMP
2026-04-25 20:54:10 ADT
strixhalo · Ryzen AI MAX+ 395 · gfx1151 · 128 GB LPDDR5x-8000
lemond :8180 · in-process backend · steady-state · no prompt cache
================================================================

=== Hardware probe ===
AMD RYZEN AI MAX+ 395 w/ Radeon 8060S
32 logical cores
124Gi RAM, 6.3Gi free
GPU[0]		: Card Series: 		Radeon 8060S Graphics
GPU[0]		: Card Model: 		0x1586
GPU[0]		: Card Vendor: 		Advanced Micro Devices, Inc. [AMD/ATI]
GPU[0]		: Temperature (Sensor edge) (C): 44.0
GPU[0]		: Current Socket Graphics Package Power (W): 37.056
  NPU Firmware Version : 1.0.0.166
Device(s) Present
|[0000:c6:00.1]  |RyzenAI-npu5  |

=== Loaded model count ===
13 models loaded simultaneously:
  Bonsai-1.7B-gguf                           llamacpp         gpu
  nomic-embed-text-v1-GGUF                   llamacpp         gpu
  user.halo-1bit-2b-sherry-cpp               1bit-ternary     gpu
  Llama-3.2-1B-Instruct-GGUF                 llamacpp         gpu
  Phi-4-mini-instruct-GGUF                   llamacpp         gpu
  Qwen3-4B-GGUF                              llamacpp         gpu
  user.halo-1bit-2b-sherry-v4                1bit-ternary     gpu
  user.halo-1bit-2b-sherry-v3                1bit-ternary     gpu
  user.halo-1bit-2b                          1bit-ternary     gpu
  user.halo-bitnet-2b-tq2                    1bit-ternary     gpu
  user.halo-1bit-2b-tq1                      1bit-ternary     gpu
  user.halo-bitnet-2b-tq2-pt                 1bit-ternary     gpu
  user.bonsai-1.7b-tq2-h1b                   1bit-ternary     gpu

=== LLM throughput (256-tok essay, median-of-3) ===
MODEL                                        BACKEND            TRIALS (tok/s)                   MEDIAN
user.halo-1bit-2b-sherry-cpp                 rocm-cpp ternary   72.38 72.66 72.62                72.6
user.halo-1bit-2b-sherry-v3                  rocm-cpp ternary   71.45 71.65 71.61                71.6
user.halo-1bit-2b-sherry-v4                  rocm-cpp ternary   71.41 71.47 71.42                71.4
user.halo-1bit-2b                            rocm-cpp ternary   65.29 65.30 65.32                65.3
user.halo-1bit-2b-tq1                        rocm-cpp ternary   56.74 56.71 56.80                56.7
user.bonsai-1.7b-tq2-h1b                     rocm-cpp ternary   58.84 58.80 58.81                58.8
user.halo-bitnet-2b-tq2                      rocm-cpp ternary   52.51 52.55 52.52                52.5
user.halo-bitnet-2b-tq2-pt                   rocm-cpp ternary   52.79 52.74 52.70                52.7
Bonsai-1.7B-gguf                             llama.cpp Vulkan Q1_0 267.57 227.91 243.17             243.2
user.smollm2-135m                            llama.cpp Vulkan Q8_0 504.07 532.11 526.16             526.2
user.gemma-3-270m-it                         llama.cpp Vulkan IQ2_M 342.52 447.87 450.14             447.9
user.deepseek-r1-distill-qwen-1.5b           llama.cpp Vulkan Q4_K_M 163.70 167.75 167.84             167.8
Llama-3.2-1B-Instruct-GGUF                   llama.cpp Vulkan Q4_K_XL 191.58 198.89 198.99             198.9
Phi-4-mini-instruct-GGUF                     llama.cpp Vulkan Q4_K_M 71.14 70.50 67.29                70.5
Qwen3-4B-GGUF                                llama.cpp Vulkan Q4_0 76.72 70.96 68.63                71.0
Qwen3-8B-GGUF                                llama.cpp Vulkan Q8_0 24.33 23.98 23.98                24.0

=== TTFT (first-token latency, ms, median-of-3) ===
MODEL                                        BACKEND            TIMES (ms)
Bonsai-1.7B-gguf                             llama.cpp Vulkan   40.8 24.0 27.1 
user.smollm2-135m                            llama.cpp Vulkan   120.6 19.4 18.2 
user.halo-1bit-2b-sherry-cpp                 rocm-cpp ternary   148.3 148.4 146.4 
user.halo-1bit-2b                            rocm-cpp ternary   164.4 164.1 164.3 

=== Long-context decode (sherry-cpp, 64/256/1024/2048 tokens) ===
  L=64    56.08 tok/s
  L=256   67.22 tok/s
  L=1024  70.15 tok/s
  L=2048  69.90 tok/s

=== Embeddings (nomic-embed-text-v1, 768-dim) ===
  batch=1     dim=768   96.7 embed/s
  batch=4     dim=768   77.5 embed/s
  batch=16    dim=768   557.8 embed/s
  batch=64    dim=768   737.8 embed/s

=== TTS (kokoro, am_michael) ===
  http 200 size 616844
  6.425000s audio in 0.57s = 11.27x realtime

=== STT (whisper) ===
  0.25s elapsed → " The ternary kernel runs at 92% of LPDDR 5 XP."

=== Image gen (sd-cpp on :8081) ===
  sd service up — see /sd/* for txt2img

=== NPU (XDNA 2 / AIE2P) ===
  NPU Firmware Version : 1.0.0.166
Device(s) Present
|[0000:c6:00.1]  |RyzenAI-npu5  |

=== Wikitext-103 PPL (chunked 1024, halo v2) ===
  baseline 9.16 (gen-1 reference 9.1607, ±0.05 tolerance, last benchmarks/ppl-gen2.sh run)

=== WMMA FP16 TFLOPS ===
  measured 50.17 TFLOPS gfx1151 wave32 WMMA (rocm-cpp microbench)

=== Memory bandwidth ceiling ===
  ternary GEMV measured 92% of LPDDR5x-8000 peak (256 GB/s nominal)
  decode is bandwidth-locked; tok/s scales with GB-touched-per-token

=== Concurrent decode (8 ternary models loaded simultaneously) ===
  see 'Loaded model count' section above — proves multi-model cohabitation
  cold model swap < 200 ms; hot model dispatch zero-copy via in-process Engine

=== HTTP gateway latency ===
  100 sequential /health probes: 3.84 ms each (cpp-httplib + nlohmann/json)

================================================================
END OF DUMP. The kernel is at 92% of peak. The next throughput
jump comes from packing density, not retraining: BitNet v2 W1.58A4
(arXiv 2504.18415) drops activations to 4 bit. Run 6 next.
================================================================
reddit.com
u/Creepy-Douchebag — 19 days ago