ExplorerData ScienceMachine Learning
Research PaperResearchia:202605.13010

Elastic Attention Cores for Scalable Vision Transformers

Alan Z. Song

Abstract

Vision Transformers (ViTs) achieve strong data-driven scaling by leveraging all-to-all self-attention. However, this flexibility incurs a computational cost that scales quadratically with image resolution, limiting ViTs in high-resolution domains. Underlying this approach is the assumption that pairwise token interactions are necessary for learning rich visual-semantic representations. In this work, we challenge this assumption, demonstrating that effective visual representations can be learned ...

Submitted: May 13, 2026Subjects: Machine Learning; Data Science

Description / Details

Vision Transformers (ViTs) achieve strong data-driven scaling by leveraging all-to-all self-attention. However, this flexibility incurs a computational cost that scales quadratically with image resolution, limiting ViTs in high-resolution domains. Underlying this approach is the assumption that pairwise token interactions are necessary for learning rich visual-semantic representations. In this work, we challenge this assumption, demonstrating that effective visual representations can be learned without any direct patch-to-patch interaction. We propose VECA (Visual Elastic Core Attention), a vision transformer architecture that uses efficient linear-time core-periphery structured attention enabled by a small set of learned cores. In VECA, these cores act as a communication interface: patch tokens exchange information exclusively through the core tokens, which are initialized from scratch and propagated across layers. Because the NN image patches only directly interact with a resolution invariant set of CC learned "core" embeddings, this yields linear complexity O(N)O(N) for predetermined CC, which bypasses quadratic scaling. Compared to prior cross-attention architectures, VECA maintains and iteratively updates the full set of NN input tokens, avoiding a small CC-way bottleneck. Combined with nested training along the core axis, our model can elastically trade off compute and accuracy during inference. Across classification and dense tasks, VECA achieves performance competitive with the latest vision foundation models while reducing computational cost. Our results establish elastic core-periphery attention as a scalable alternative building block for Vision Transformers.


Source: arXiv:2605.12491v1 - http://arxiv.org/abs/2605.12491v1 PDF: https://arxiv.org/pdf/2605.12491v1 Original Link: http://arxiv.org/abs/2605.12491v1

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Access Paper
View Source PDF
Submission Info
Date:
May 13, 2026
Topic:
Data Science
Area:
Machine Learning
Comments:
0
Bookmark
Elastic Attention Cores for Scalable Vision Transformers | Researchia