Back to Explorer
Research PaperResearchia:202601.29198[Statistics & ML > Statistics]

Why Adam Works Better with $β_1 = β_2$: The Missing Gradient Scale Invariance Principle

Alberto Fernández-Hernández

Abstract

Adam has been at the core of large-scale training for almost a decade, yet a simple empirical fact remains unaccounted for: both validation scores and the qualitative behaviour of the training runs improve when the momentum parameters satisfy β1=β2β_{1}=β_{2}. Some recent studies have reported this pattern, but there is still no explanation for why this choice helps. We show that this choice is closely tied to a structural property that we refer to as \textit{gradient scale invariance}. We formalize this notion and prove that Adam becomes gradient scale invariant of first order if and only if β1=β2β_{1}=β_{2}. This perspective places the balanced regime of Adam in direct alignment with the design principles underlying several recent optimizers that explicitly enforce scale-robust updates. The theory is supported by experiments across vision and language tasks, and across different architectural families, in which rescaling the gradient has a markedly smoother effect on the update when β1=β2β_{1}=β_{2}. Overall, our results offer a coherent explanation for an open question in the behavior of Adam and provide a simple principle that helps guide the design of future optimizers.


Source: arXiv:2601.21739v1 - http://arxiv.org/abs/2601.21739v1 PDF: https://arxiv.org/pdf/2601.21739v1 Original Link: http://arxiv.org/abs/2601.21739v1

Submission:1/29/2026
Comments:0 comments
Subjects:Statistics; Statistics & ML
Original Source:
View Original PDF
arXiv: This paper is hosted on arXiv, an open-access repository
Was this helpful?

Discussion (0)

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!