u/44seconds

Wanted to share our latest paper on an alternative building block for Vision Transformers.

Illustration of our model's accuracy and dense features

Traditional ViTs utilize dense (N^(2)) self-attention, which can become pretty costly at higher resolutions. In this work, we propose an alternative backbone with a core-periphery block-sparse attention structure that scales as (2NC + C^(2)) for C core tokens.

We further train this using nested dropout, which enables test-time elastic adjustments to the inference cost. The whole model can achieve very competitive dense & classification accuracy compared with DINOv3, and is stable across resolutions (256 all the way to 1024).

Interestingly, the core-dense attention patterns exhibit strong emergent behavior. At early layers of the network the attention maps are isotropic (spherical), but become increasingly semantically aligned deeper into the network.

Visual Elastic Core Attention paper abstract

While adjusting the number of core tokens, if you decrease the number of cores, the attention patterns become more diffuse & cover a spatially larger region. If you increase the number of core tokens, the attention patterns become smaller & more concentrated.

Paper: https://arxiv.org/abs/2605.12491

Project with the code (still in progress): https://github.com/alansong1322/VECA

Happy to answer any questions about our research.

Elastic Attention Cores for Scalable Vision Transformers [R]