u/AppropriateGuava6262

Image 1 — SenseNova-U1 Technical Report: VAE-free Pixel-level Flow Matching with 32x Compression
Image 2 — SenseNova-U1 Technical Report: VAE-free Pixel-level Flow Matching with 32x Compression

SenseNova-U1 Technical Report: VAE-free Pixel-level Flow Matching with 32x Compression

When working with SD or FLUX, haven’t you all been frustrated by the loss of detail and blurred text caused by VAEs? SenseNova-U1 has completely ditched VAEs and visual encoders. Recently, SenseTime released a technical report on this model, so let’s dissect its core methodology.

The Methodology:

  1. VAE-Free Visual Interface: Uses a 2-layer conv (32x compression) to encode images, with an MLP head predicting pixels directly. Features Dynamic Noise Scale (DNS) to keep SNR consistent from 512px to 2048px.

  2. Native MoT (Mixture-of-Transformers): A unified backbone where Understanding and Generation streams share Self-Attention but use decoupled FFN/Norm layers, routed dynamically by token type.

  3. Joint Training & Deployment: Optimized via combined Auto-regressive and Flow Matching losses. Uses a 6-stage training pipeline (Warm-up → SFT → 8-step Distillation). Deployed via LightLLM/LightX2V for independent parallel scheduling.

Variants:

8B-MoT: Dense 8B dual-stream.

A3B-MoT: MoE version (30B total, 3B active).

SenseNova-U1 demonstrates that pixel-level native unification without relying on VAEs is feasible. This ability to restore details at a 32x compression ratio may become the standard paradigm for next-generation vision models.

Discord: https://discord.com/invite/BuTXPHmQub

Technical Report: https://github.com/OpenSenseNova/SenseNova-U1/blob/main/docs/pdf/SenseNOVA_U1.pdf

u/AppropriateGuava6262 — 19 hours ago