ExplorerData ScienceStatistics
Research PaperResearchia:202603.16026

Influence Malleability in Linearized Attention: Dual Implications of Non-Convergent NTK Dynamics

Jose Marie Antonio Miñoza

Abstract

Understanding the theoretical foundations of attention mechanisms remains challenging due to their complex, non-linear dynamics. This work reveals a fundamental trade-off in the learning dynamics of linearized attention. Using a linearized attention mechanism with exact correspondence to a data-dependent Gram-induced kernel, both empirical and theoretical analysis through the Neural Tangent Kernel (NTK) framework shows that linearized attention does not converge to its infinite-width NTK limit, ...

Submitted: March 16, 2026Subjects: Statistics; Data Science

Description / Details

Understanding the theoretical foundations of attention mechanisms remains challenging due to their complex, non-linear dynamics. This work reveals a fundamental trade-off in the learning dynamics of linearized attention. Using a linearized attention mechanism with exact correspondence to a data-dependent Gram-induced kernel, both empirical and theoretical analysis through the Neural Tangent Kernel (NTK) framework shows that linearized attention does not converge to its infinite-width NTK limit, even at large widths. A spectral amplification result establishes this formally: the attention transformation cubes the Gram matrix's condition number, requiring width m=Ω(κ6)m = Ω(κ^6) for convergence, a threshold that exceeds any practical width for natural image datasets. This non-convergence is characterized through influence malleability, the capacity to dynamically alter reliance on training examples. Attention exhibits 6--9×\times higher malleability than ReLU networks, with dual implications: its data-dependent kernel can reduce approximation error by aligning with task structure, but this same sensitivity increases susceptibility to adversarial manipulation of training data. These findings suggest that attention's power and vulnerability share a common origin in its departure from the kernel regime.


Source: arXiv:2603.13085v1 - http://arxiv.org/abs/2603.13085v1 PDF: https://arxiv.org/pdf/2603.13085v1 Original Link: http://arxiv.org/abs/2603.13085v1

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Access Paper
View Source PDF
Submission Info
Date:
Mar 16, 2026
Topic:
Data Science
Area:
Statistics
Comments:
0
Bookmark