Quantifying Concentration Phenomena of Mean-Field Transformers in the Low-Temperature Regime
Abstract
Transformers with self-attention modules as their core components have become an integral architecture in modern large language and foundation models. In this paper, we study the evolution of tokens in deep encoder-only transformers at inference time which is described in the large-token limit by a mean-field continuity equation. Leveraging ideas from the convergence analysis of interacting multi-particle systems, with particles corresponding to tokens, we prove that the token distribution rapid...
Description / Details
Transformers with self-attention modules as their core components have become an integral architecture in modern large language and foundation models. In this paper, we study the evolution of tokens in deep encoder-only transformers at inference time which is described in the large-token limit by a mean-field continuity equation. Leveraging ideas from the convergence analysis of interacting multi-particle systems, with particles corresponding to tokens, we prove that the token distribution rapidly concentrates onto the push-forward of the initial distribution under a projection map induced by the key, query, and value matrices, and remains metastable for moderate times. Specifically, we show that the Wasserstein distance of the two distributions scales like in terms of the temperature parameter and inference time . For the proof, we establish Lyapunov-type estimates for the zero-temperature equation, identify its limit as , and employ a stability estimate in Wasserstein space together with a quantitative Laplace principle to couple the two equations. Our result implies that for time scales of order the token distribution concentrates at the identified limiting distribution. Numerical experiments confirm this and, beyond that, complement our theory by showing that for finite and large the dynamics enter a different terminal phase, dominated by the spectrum of the value matrix.
Source: arXiv:2605.10931v1 - http://arxiv.org/abs/2605.10931v1 PDF: https://arxiv.org/pdf/2605.10931v1 Original Link: http://arxiv.org/abs/2605.10931v1
Please sign in to join the discussion.
No comments yet. Be the first to share your thoughts!
May 12, 2026
Data Science
Machine Learning
0