ExplorerArtificial IntelligenceAI
Research PaperResearchia:202607.02048

The State-Prediction Separation Hypothesis

Giovanni Monea

Abstract

Transformers use the same forward computation stream to both predict the next token and store useful state for future token predictions. We formulate the \emph{state-prediction separation hypothesis}: disentangling the two roles yields better language modeling performance. We design a Transformer variant that uses two computation streams to separate the two functions, and conduct pretraining experiments across various scales. Our experiments show that state-prediction separation consistently off...

Submitted: July 2, 2026Subjects: AI; Artificial Intelligence

Description / Details

Transformers use the same forward computation stream to both predict the next token and store useful state for future token predictions. We formulate the \emph{state-prediction separation hypothesis}: disentangling the two roles yields better language modeling performance. We design a Transformer variant that uses two computation streams to separate the two functions, and conduct pretraining experiments across various scales. Our experiments show that state-prediction separation consistently offers better data and compute efficiencies, improving validation loss and outperforming standard Transformers by 2--3 percentage points on average on downstream tasks. We also conduct extensive empirical analysis that rules out potential confounders and demonstrates the fundamental difference in the gradients our design entails.


Source: arXiv:2607.01218v1 - http://arxiv.org/abs/2607.01218v1 PDF: https://arxiv.org/pdf/2607.01218v1 Original Link: http://arxiv.org/abs/2607.01218v1

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Access Paper
View Source PDF
Submission Info
Date:
Jul 2, 2026
Topic:
Artificial Intelligence
Area:
AI
Comments:
0
Bookmark
The State-Prediction Separation Hypothesis | Researchia