ExplorerData ScienceMachine Learning
Research PaperResearchia:202605.12055

Quantifying Concentration Phenomena of Mean-Field Transformers in the Low-Temperature Regime

Albert Alcalde

Abstract

Transformers with self-attention modules as their core components have become an integral architecture in modern large language and foundation models. In this paper, we study the evolution of tokens in deep encoder-only transformers at inference time which is described in the large-token limit by a mean-field continuity equation. Leveraging ideas from the convergence analysis of interacting multi-particle systems, with particles corresponding to tokens, we prove that the token distribution rapid...

Submitted: May 12, 2026Subjects: Machine Learning; Data Science

Description / Details

Transformers with self-attention modules as their core components have become an integral architecture in modern large language and foundation models. In this paper, we study the evolution of tokens in deep encoder-only transformers at inference time which is described in the large-token limit by a mean-field continuity equation. Leveraging ideas from the convergence analysis of interacting multi-particle systems, with particles corresponding to tokens, we prove that the token distribution rapidly concentrates onto the push-forward of the initial distribution under a projection map induced by the key, query, and value matrices, and remains metastable for moderate times. Specifically, we show that the Wasserstein distance of the two distributions scales like log(β+1)/βexp(Ct)+exp(ct)\sqrt{{\log(β+1)}/β}\exp(Ct)+\exp(-ct) in terms of the temperature parameter β10β^{-1}\to 0 and inference time t0t\geq 0. For the proof, we establish Lyapunov-type estimates for the zero-temperature equation, identify its limit as tt\to\infty, and employ a stability estimate in Wasserstein space together with a quantitative Laplace principle to couple the two equations. Our result implies that for time scales of order logβ\logβ the token distribution concentrates at the identified limiting distribution. Numerical experiments confirm this and, beyond that, complement our theory by showing that for finite ββ and large tt the dynamics enter a different terminal phase, dominated by the spectrum of the value matrix.


Source: arXiv:2605.10931v1 - http://arxiv.org/abs/2605.10931v1 PDF: https://arxiv.org/pdf/2605.10931v1 Original Link: http://arxiv.org/abs/2605.10931v1

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Access Paper
View Source PDF
Submission Info
Date:
May 12, 2026
Topic:
Data Science
Area:
Machine Learning
Comments:
0
Bookmark
Quantifying Concentration Phenomena of Mean-Field Transformers in the Low-Temperature Regime | Researchia