u/tecneeq

Image 1 — Before and after of my homelab
Image 2 — Before and after of my homelab
▲ 195 r/StrixHalo+1 crossposts

Before and after of my homelab

  • Bosgame M5 (Strix Halo 128GB Ram, runs Proxmox, a bunch of LXCs and Docker containers and Qwen 3.6 a35b-a3b Q8 with 60 t/s at full context.)
  • 2x 4bay USB3 thingies with 6x 26TB and 2x 16TB
  • Tec Mojo 10" Rack with 12U
  • Flint 2 WLAN router (runs Docker with Portainer and PiHole)

It'l all end up in the same place the mess was before, this placement is just temporary for installation.

edit #1: 60 instead of 80 t/s.

u/tecneeq — 6 days ago
▲ 87 r/StrixHalo+1 crossposts

Some of you saw our post a couple weeks back about hitting 102 tok/s stable on Qwen3.5-35B on a DGX Spark. A lot of you asked "cool, where's the code?" Today's the day: Github

Atlas is open source. Pure Rust + CUDA, no PyTorch, no Python runtime, ~2.5 GB image, <2 minute cold start. We rewrote the whole stack from HTTP handler to kernel dispatch because the bottleneck on Spark wasn't the silicon, it was 20+ GB of generic Python machinery sitting between your prompt and the GPU. We need community support to keep elevating Atlas for developers.

Numbers on a single DGX Spark (GB10):

Qwen3.5-35B (NVFP4, MTP K=2): 130 tok/s peak, ~111 tok/s sustained → 3.0–3.3x vLLM at testing time

Qwen3.5-122B (NVFP4, EP=2): ~50 tok/s decode

Qwen3-Next-80B-A3B (NVFP4, MTP): ~87 tok/s

Nemotron-3 Nano 30B (FP8): ~88 tok/s

Full model matrix on the site (Minimax2.7, Qwen3.6, Gemma too!)

What's actually different:

Hand-tuned CUDA kernels for Blackwell SM120/121 meaning attention, MoE, GDN, Mamba-2. No generic fallbacks.

Native NVFP4 + FP8 on tensor cores

MTP (Multi-Token Prediction) speculative decoding for up to 3x throughput on decode

OpenAI + Anthropic API on the same port, works with Claude Code, Cline, OpenCode, Open WebUI out of the box

Try it (two commands):

docker pull avarok/atlas-gb10:latest
sudo docker run -d --name atlas --network host --gpus all --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  avarok/atlas-gb10:latest serve Qwen/Qwen3.6-35B-A3B-FP8 \
  --port 8888 --speculative --enable-prefix-caching

What's next especially for the non-Spark folks: we're working with Spectral Compute on a Strix Halo port, and AMD is giving us hardware to do it properly. RTX 6000 Pro Blackwell is also on the roadmap. Same kernel philosophy, adapted per chip, we'd rather do four chips well than twenty chips badly.

X/Twitter
Site
Discord

Will be in comments all day. Hit us with edge cases, weird models, broken configs. The roadmap is genuinely community-driven. MiniMax M2.7 landed because someone in Discord asked.

u/Live-Possession-6726 — 6 days ago