u/Diligent-End-2711 — reddlx

[ Removed by Reddit ]

[ Removed by Reddit on account of violating the content policy. ]

Hi there! I just open-sourced a high-performance inference engine focused on local and real-time workloads. Qwen3.6 27B (NVFP4) on FlashRT:

129 tok/s on a single RTX 5090
Supports up to 256K context

Would love for people to try it out and share feedback! https://github.com/LiangSu8899/FlashRT

reddit.com

u/Diligent-End-2711 — 7 days ago

▲ 13 r/robotics

VLA RL based on π0.5

🚀 I’ve successfully implemented the RL pipeline introduced in the π0.6 RECAP paper, and fully brought VLA RL onto the π0.5 stack.

Our current pipeline now supports:

• End-to-end VLA RL training & inference
• RECAP-style advantage-conditioned policy training
• QLoRA fine-tuning optimization
• Unified PyTorch + JAX execution paths

On the systems side, I also optimized the full RL runtime stack:

⚡ Up to 5× faster RL inference
⚡ Up to 2.2× faster QLoRA fine-tuning
⚡ Full pipeline running in only ~10GB VRAM

This includes:
• value function training
• ACP annotation
• RL policy fine-tuning
• CFG-guided inference

Made real VLA RL experimentation practical on consumer GPUs instead of requiring multi-H100 setups.

Would love for more people in the VLA / robotics community to try it out and give feedback.

https://github.com/LiangSu8899/FlashRT

https://preview.redd.it/gri1pmjo4rzg1.png?width=1201&format=png&auto=webp&s=61bf0bebbfbbd119dac5914a9d921aee206cfc6b

reddit.com

u/Diligent-End-2711 — 7 days ago

▲ 46 r/Qwen_AI

Hi there! I just open-sourced a high-performance inference engine focused on local and real-time workloads. Qwen3.6 27B (NVFP4) on FlashRT:

129 tok/s on a single RTX 5090
Supports up to 256K context

Would love for people to try it out and share feedback! https://github.com/LiangSu8899/FlashRT

reddit.com

u/Diligent-End-2711 — 8 days ago

▲ 20 r/CUDA+1 crossposts

Hi everyone,

I’m an independent developer with a background in algorithms, HPC, and robotics infrastructure. Recently I’ve been working on a lightweight inference engine built around hand-written CUDA kernels, focusing on small-batch and real-time performance (especially for VLA and robotics workloads).

Here are some recent results on Thor and Blackwell:

Pi0.5 — Jetson AGX Thor (SM110): 44 ms (23 Hz)
Pi0 — Jetson AGX Thor (SM110): 46 ms (22 Hz)
Pi0.5 — RTX 5090 (SM120): 17.58 ms (57 Hz)
Pi0 — RTX 5090 (SM120): 18.43 / 21.16 / 24.48 ms (54 / 47 / 41 Hz)
GROOT N1.6 — Jetson AGX Thor: 45 ms (T=50) / 41 ms (T=16) → 22 / 24 Hz
GROOT N1.6 — RTX 5090: 13.08 ms (T=50) / 12.53 ms (T=16) → 76 / 80 Hz
Pi0-FAST (token)
- Thor: 8.1 ms/token (123 tok/s)
- RTX 5090: 2.39 ms/token (418 tok/s)

The focus is on pushing true real-time inference under small-batch settings, which tends to be underserved by typical large-batch optimized stacks.

Still early, but happy to share more details or discuss if anyone is working on similar workloads 🙂

Feeback welcome！：https://github.com/LiangSu8899/FlashRT

u/Diligent-End-2711 — 4 days ago