ExplorerArtificial IntelligenceAI
Research PaperResearchia:202606.03013

q0: Primitives for Hyper-Epoch Pretraining

Bishwas Mandal

Abstract

Multi-epoch training is becoming the standard now that compute is growing faster than the supply of high-quality text. But pretraining a single model saturates within a few passes, long before the compute budget is exhausted. We argue this calls for a conceptual shift from training a single model toward exploring a population of models and aggregating their predictions. We introduce hyper-epoch pretraining (q0), which turns a multi-epoch budget into a population of diverse models whose combined ...

Submitted: June 3, 2026Subjects: AI; Artificial Intelligence

Description / Details

Multi-epoch training is becoming the standard now that compute is growing faster than the supply of high-quality text. But pretraining a single model saturates within a few passes, long before the compute budget is exhausted. We argue this calls for a conceptual shift from training a single model toward exploring a population of models and aggregating their predictions. We introduce hyper-epoch pretraining (q0), which turns a multi-epoch budget into a population of diverse models whose combined predictions reach a lower validation loss than a single refined model. q0 reduces to three core primitives. A cyclic schedule with anti-correlated learning rate and weight decay collects diverse models from a few parallel trajectories. Chain distillation trains each model against its predecessor so that model quality compounds across the population. A learned prior, fit on a held out set, selects and weights members for any inference budget. On a 1.8B-parameter model trained on 100M FineWeb tokens, q0 matches a strong 256-epoch ensemble baseline using only 56{\sim}56 epochs (4.6×{\sim}4.6\times fewer), or 67{\sim}67 epochs (3.8×{\sim}3.8\times fewer) when matched to the baseline's ensemble size, and continues to improve beyond it. These gains reach cumulative 12.9×{\sim}12.9\times data efficiency under the Slowrun setting and transfer to downstream benchmarks. Crucially, the optimal allocation shifts with the budget, so we give prescriptive recipes for how to spend a given epoch budget to maximize generalization, from a single epoch up to the largest budgets.


Source: arXiv:2606.03938v1 - http://arxiv.org/abs/2606.03938v1 PDF: https://arxiv.org/pdf/2606.03938v1 Original Link: http://arxiv.org/abs/2606.03938v1

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Access Paper
View Source PDF
Submission Info
Date:
Jun 3, 2026
Topic:
Artificial Intelligence
Area:
AI
Comments:
0
Bookmark