ExplorerMathematicsMathematics
Research PaperResearchia:202603.04032

Adam Converges Without Any Modification On Update Rules

Yushun Zhang

Abstract

Adam is the default algorithm for training neural networks, including large language models (LLMs). However, \citet{reddi2019convergence} provided an example that Adam diverges, raising concerns for its deployment in AI model training. We identify a key mismatch between the divergence example and practice: \citet{reddi2019convergence} pick the problem after picking the hyperparameters of Adam, i.e., $(β_1,β_2)$; while practical applications often fix the problem first and then tune $(β_1,β_2)$. ...

Submitted: March 4, 2026Subjects: Mathematics; Mathematics

Description / Details

Adam is the default algorithm for training neural networks, including large language models (LLMs). However, \citet{reddi2019convergence} provided an example that Adam diverges, raising concerns for its deployment in AI model training. We identify a key mismatch between the divergence example and practice: \citet{reddi2019convergence} pick the problem after picking the hyperparameters of Adam, i.e., (β1,β2)(β_1,β_2); while practical applications often fix the problem first and then tune (β1,β2)(β_1,β_2). In this work, we prove that Adam converges with proper problem-dependent hyperparameters. First, we prove that Adam converges when β2β_2 is large and β1<β2β_1 < \sqrt{β_2}. Second, when β2β_2 is small, we point out a region of (β1,β2)(β_1,β_2) combinations where Adam can diverge to infinity. Our results indicate a phase transition for Adam from divergence to convergence when changing the (β1,β2)(β_1, β_2) combination. To our knowledge, this is the first phase transition in (β1,β2)(β_1,β_2) 2D-plane reported in the literature, providing rigorous theoretical guarantees for Adam optimizer. We further point out that the critical boundary (β1,β2)(β_1^*, β_2^*) is problem-dependent, and particularly, dependent on batch size. This provides suggestions on how to tune β1β_1 and β2β_2: when Adam does not work well, we suggest tuning up β2β_2 inversely with batch size to surpass the threshold β2β_2^*, and then trying β1<β2β_1< \sqrt{β_2}. Our suggestions are supported by reports from several empirical studies, which observe improved LLM training performance when applying them.


Source: arXiv:2603.02092v1 - http://arxiv.org/abs/2603.02092v1 PDF: https://arxiv.org/pdf/2603.02092v1 Original Link: http://arxiv.org/abs/2603.02092v1

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Access Paper
View Source PDF
Submission Info
Date:
Mar 4, 2026
Topic:
Mathematics
Area:
Mathematics
Comments:
0
Bookmark