ExplorerMathematicsMathematics
Research PaperResearchia:202606.23028

Open Problem: Is AdamW Effective Under Heavy-Tailed Noise?

Dingzhi Yu

Abstract

AdamW is the de facto optimizer for training large language models (LLMs), yet the theory behind it still lives mostly in finite-variance regimes. This is increasingly unsatisfying, as empirical evidence indicates that stochastic gradient noise in LLM pretraining is typically heavy-tailed. Recent work shows that sign-based optimizers such as Lion and Muon achieve sharp heavy-tailed rates, and that AdaGrad can also converge under heavy-tailed noise. However, no rigorous convergence theory for Ada...

Submitted: June 23, 2026Subjects: Mathematics; Mathematics

Description / Details

AdamW is the de facto optimizer for training large language models (LLMs), yet the theory behind it still lives mostly in finite-variance regimes. This is increasingly unsatisfying, as empirical evidence indicates that stochastic gradient noise in LLM pretraining is typically heavy-tailed. Recent work shows that sign-based optimizers such as Lion and Muon achieve sharp heavy-tailed rates, and that AdaGrad can also converge under heavy-tailed noise. However, no rigorous convergence theory for AdamW has yet been established in this regime. Can AdamW converge under the same heavy-tailed assumptions, or does its second-moment accumulator create a genuine obstruction? We formulate this as an open problem, prove a positive weighted-metric benchmark, and give a corridor lower-bound mechanism showing how denominator memory can hide large gradients.


Source: arXiv:2606.23676v1 - http://arxiv.org/abs/2606.23676v1 PDF: https://arxiv.org/pdf/2606.23676v1 Original Link: http://arxiv.org/abs/2606.23676v1

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Access Paper
View Source PDF
Submission Info
Date:
Jun 23, 2026
Topic:
Mathematics
Area:
Mathematics
Comments:
0
Bookmark
Open Problem: Is AdamW Effective Under Heavy-Tailed Noise? | Researchia