Explorerβ€ΊArtificial Intelligenceβ€ΊAI
Research PaperResearchia:202604.21002

Sessa: Selective State Space Attention

Liubomyr Horbatko

Abstract

Modern sequence models are dominated by Transformers, where self-attention mixes information from the visible context in an input-dependent way. However, when retrieval is not sharp and attention remains diffuse over an effective support $S_{\mathrm{eff}}(t)$, the influence of any individual token is diluted, typically scaling as $O(1/S_{\mathrm{eff}}(t))$ and reaching $O(1/\ell)$ for old tokens in full-prefix settings. Structured state-space models process sequences recurrently through an expli...

Submitted: April 21, 2026Subjects: AI; Artificial Intelligence

Description / Details

Modern sequence models are dominated by Transformers, where self-attention mixes information from the visible context in an input-dependent way. However, when retrieval is not sharp and attention remains diffuse over an effective support Seff(t)S_{\mathrm{eff}}(t), the influence of any individual token is diluted, typically scaling as O(1/Seff(t))O(1/S_{\mathrm{eff}}(t)) and reaching O(1/β„“)O(1/\ell) for old tokens in full-prefix settings. Structured state-space models process sequences recurrently through an explicit feedback path; selective variants such as Mamba make this feedback input-dependent, yet when freeze time cannot be sustained over long intervals, their long-range sensitivity decays exponentially with lag. Existing architectures therefore either retrieve from the past in a single read or propagate information through a single feedback chain. We introduce Sessa, a decoder that places attention inside a feedback path, enabling recurrent many-path aggregation within a layer. Under stated assumptions, Sessa admits regimes with a power-law memory tail in lag β„“\ell of order O(β„“βˆ’Ξ²)O(\ell^{-Ξ²}) for 0<Ξ²<10<Ξ²<1, which is asymptotically slower than 1/β„“1/\ell; moreover, this rate is tight in an explicit diffuse uniform-routing setting where the influence is Θ(β„“βˆ’Ξ²)Θ(\ell^{-Ξ²}). Under the same conditions, only Sessa among the compared model classes realizes flexible selective retrieval, including non-decaying profiles. Empirically, under matched architectures and training budgets, Sessa achieves the strongest performance on our long-context benchmarks while remaining competitive with Transformer and Mamba style baselines on short-context language modeling.


Source: arXiv:2604.18580v1 - http://arxiv.org/abs/2604.18580v1 PDF: https://arxiv.org/pdf/2604.18580v1 Original Link: http://arxiv.org/abs/2604.18580v1

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Access Paper
View Source PDF
Submission Info
Date:
Apr 21, 2026
Topic:
Artificial Intelligence
Area:
AI
Comments:
0
Bookmark
Sessa: Selective State Space Attention | Researchia