How Width and Data Shape Generalization Scaling Laws in Quadratic Neural Networks
Abstract
Understanding how performance scales jointly with model size and data is a central problem in modern machine learning. Existing theoretical works on scaling laws typically describe generalization as a function of data or compute, often in fixed-feature or infinite-width regimes and for online SGD. Here, we instead study how generalization scales with the number of trainable parameters and the number of samples in a feature-learning model. We analyze $\ell_2$-regularized empirical test error mini...
Description / Details
Understanding how performance scales jointly with model size and data is a central problem in modern machine learning. Existing theoretical works on scaling laws typically describe generalization as a function of data or compute, often in fixed-feature or infinite-width regimes and for online SGD. Here, we instead study how generalization scales with the number of trainable parameters and the number of samples in a feature-learning model. We analyze -regularized empirical test error minimization in a quadratic two-layer network in a finite-sample setting with structured data. This setting allows for an explicit characterization of the generalization error as a function of the number of samples, model width, and regularization. Our results reveal a phase diagram with distinct scaling regimes as the number of parameters varies. In particular, the generalization error follows data-dependent power laws controlled by the spectral structure of the target. We further characterize the transitions between regimes, including the onset of interpolation, and their impact on generalization.
Source: arXiv:2606.28242v1 - http://arxiv.org/abs/2606.28242v1 PDF: https://arxiv.org/pdf/2606.28242v1 Original Link: http://arxiv.org/abs/2606.28242v1
Please sign in to join the discussion.
No comments yet. Be the first to share your thoughts!
Jun 29, 2026
Data Science
Statistics
0