FLM on Strix Halo Linux: where I think the NPU fits
I spent some time testing FastFlowLM on a Ryzen AI MAX+ 395 Strix Halo box under Ubuntu 24.04 with kernel 7.0 and the in-tree amdxdna driver.
Short version: the NPU is real and useful, but I do not think it should be the only local inference backend.
My rough FLM numbers, using FLM's own counters:
| Model | Decode |
|---|---|
| qwen3:0.6b | 93 t/s |
| qwen3:1.7b | 42 t/s |
| llama3.2:3b | 25 t/s |
| qwen3:4b | 19 t/s |
| qwen3:8b | 11 t/s |
| llama3.1:8b | 11 t/s |
| gpt-oss:20b | 20 t/s |
The interesting part is not just raw speed. It is where the NPU is useful.
Small models are great. A 0.6B or 1.7B model on the NPU is exactly what I want for always-on local assistant work: routing, summarization, command interpretation, RAG glue, tool selection, and background agents. It can stay hot without waking up the whole APU.
The 3B/4B tier is also practical. llama3.2:3b at ~25 t/s feels like a real local assistant lane, especially if power matters.
Dense 8B models are less exciting at ~11 t/s. Usable, but that is probably where I would rather use the iGPU with llama.cpp/Vulkan or ROCm once the stack matures.
The surprise is MoE. gpt-oss:20b at ~20 t/s is about the same as qwen3:4b, which makes sense if active parameters are in the same ballpark. That may be the real NPU sweet spot: not huge dense models, but efficient low-active-parameter MoE.
So my current mental model is:
- NPU/FLM: low-power always-on lane, small models, supported MoE, agent plumbing
- iGPU/llama.cpp Vulkan: general local LLM lane, GGUF ecosystem, bigger dense models
- CPU: fallback and tiny utility models
- A router above all of it: one API, multiple backends
Windows seems closer to a unified vendor-supported plane for Ryzen AI. Linux does not have that yet. On Linux, FLM is the thing that makes the NPU useful today, but it is still a separate runtime with its own format, model list, bugs, and licensing.
The licensing is also worth noting. FastFlowLM says the orchestration/CLI code is MIT, but the NPU kernels are proprietary binaries. Their README says commercial use is free up to USD 10M annual company revenue, then you need a commercial license. That is fine for hobby projects and probably many small products, but it is not the same risk profile as a fully open stack like llama.cpp.
My takeaway: FLM is a very useful backend, but not the foundation. The foundation should be a Linux inference router that can dispatch across NPU, iGPU, and CPU. FLM plugs into that as the NPU lane.
That is probably the shape Strix Halo wants: not one runtime to rule everything, but a local inference plane that knows which silicon to use for each job.