A Hierarchical Language Model with Predictable Scaling Laws and Provable Benefits of Reasoning
Abstract
We introduce a family of synthetic languages with hierarchical structure -- generated by a broadcast process on trees -- for which the role of context length and reasoning in autoregressive generation can be analyzed precisely. At the heart of our analytic approach is an \emph{exact $k$-gram ansatz} in place of transformers with context length $k$, a substitution we then validate empirically. Using this ansatz we derive explicit asymptotic predictions for distributional statistics of the sequenc...
Description / Details
We introduce a family of synthetic languages with hierarchical structure -- generated by a broadcast process on trees -- for which the role of context length and reasoning in autoregressive generation can be analyzed precisely. At the heart of our analytic approach is an \emph{exact -gram ansatz} in place of transformers with context length , a substitution we then validate empirically. Using this ansatz we derive explicit asymptotic predictions for distributional statistics of the sequences produced by a trained model, instantiated in two settings. For the \emph{Ising broadcast process} (a soft-constrained language), we prove that the variance of the generated sum scales log-linearly in the context depth and its kurtosis converges to that of a Gaussian -- both deviating from the true language for any sublinear context. For the \emph{coloring broadcast process} (a hard-constrained language) in the freezing regime, bounded-context autoregression produces sequences that, with high probability, are inconsistent with \emph{any} valid coloring of the underlying tree. Together these results imply an lower bound on the context length required to faithfully sample length- sequences. In contrast, we prove that an autoregressive \emph{reasoning} model with only working memory can sample exactly from the true language -- an exponential improvement. We confirm both the lower-bound predictions and the reasoning-based upper bound empirically with transformers trained on the synthetic language; the trained models track our asymptotic predictions quantitatively across a wide range of context sizes.
Source: arXiv:2605.13687v1 - http://arxiv.org/abs/2605.13687v1 PDF: https://arxiv.org/pdf/2605.13687v1 Original Link: http://arxiv.org/abs/2605.13687v1
Please sign in to join the discussion.
No comments yet. Be the first to share your thoughts!
May 14, 2026
Data Science
Statistics
0