u/AppropriateGuava6262

When working with SD or FLUX, haven’t you all been frustrated by the loss of detail and blurred text caused by VAEs? SenseNova-U1 has completely ditched VAEs and visual encoders. Recently, SenseTime released a technical report on this model, so let’s dissect its core methodology.

The Methodology:

VAE-Free Visual Interface: Uses a 2-layer conv (32x compression) to encode images, with an MLP head predicting pixels directly. Features Dynamic Noise Scale (DNS) to keep SNR consistent from 512px to 2048px.
Native MoT (Mixture-of-Transformers): A unified backbone where Understanding and Generation streams share Self-Attention but use decoupled FFN/Norm layers, routed dynamically by token type.
Joint Training & Deployment: Optimized via combined Auto-regressive and Flow Matching losses. Uses a 6-stage training pipeline (Warm-up → SFT → 8-step Distillation). Deployed via LightLLM/LightX2V for independent parallel scheduling.

Variants:

8B-MoT: Dense 8B dual-stream.

A3B-MoT: MoE version (30B total, 3B active).

SenseNova-U1 demonstrates that pixel-level native unification without relying on VAEs is feasible. This ability to restore details at a 32x compression ratio may become the standard paradigm for next-generation vision models.

Discord: https://discord.com/invite/BuTXPHmQub

Technical Report: https://github.com/OpenSenseNova/SenseNova-U1/blob/main/docs/pdf/SenseNOVA_U1.pdf

SenseNova-U1 Technical Report: VAE-free Pixel-level Flow Matching with 32x Compression