The State-Prediction Separation Hypothesis
Abstract
Transformers use the same forward computation stream to both predict the next token and store useful state for future token predictions. We formulate the \emph{state-prediction separation hypothesis}: disentangling the two roles yields better language modeling performance. We design a Transformer variant that uses two computation streams to separate the two functions, and conduct pretraining experiments across various scales. Our experiments show that state-prediction separation consistently off...
Description / Details
Transformers use the same forward computation stream to both predict the next token and store useful state for future token predictions. We formulate the \emph{state-prediction separation hypothesis}: disentangling the two roles yields better language modeling performance. We design a Transformer variant that uses two computation streams to separate the two functions, and conduct pretraining experiments across various scales. Our experiments show that state-prediction separation consistently offers better data and compute efficiencies, improving validation loss and outperforming standard Transformers by 2--3 percentage points on average on downstream tasks. We also conduct extensive empirical analysis that rules out potential confounders and demonstrates the fundamental difference in the gradients our design entails.
Source: arXiv:2607.01218v1 - http://arxiv.org/abs/2607.01218v1 PDF: https://arxiv.org/pdf/2607.01218v1 Original Link: http://arxiv.org/abs/2607.01218v1
Please sign in to join the discussion.
No comments yet. Be the first to share your thoughts!
Jul 2, 2026
Artificial Intelligence
AI
0