ExplorerMathematicsMathematics
Research PaperResearchia:202605.19029

Can Adaptive Gradient Methods Converge under Heavy-Tailed Noise? A Case Study of AdaGrad

Zijian Liu

Abstract

Many tasks in modern machine learning are observed to involve heavy-tailed gradient noise during the optimization process. To manage this realistic and challenging setting, new mechanisms, such as gradient clipping and gradient normalization, have been introduced to ensure the convergence of first-order algorithms. However, adaptive gradient methods, a famous class of modern optimizers that includes popular $\mathtt{Adam}$ and $\mathtt{AdamW}$, often perform well even without any extra operation...

Submitted: May 19, 2026Subjects: Mathematics; Mathematics

Description / Details

Many tasks in modern machine learning are observed to involve heavy-tailed gradient noise during the optimization process. To manage this realistic and challenging setting, new mechanisms, such as gradient clipping and gradient normalization, have been introduced to ensure the convergence of first-order algorithms. However, adaptive gradient methods, a famous class of modern optimizers that includes popular Adam\mathtt{Adam} and AdamW\mathtt{AdamW}, often perform well even without any extra operations mentioned above. It is therefore natural to ask whether adaptive gradient methods can converge under heavy-tailed noise without any algorithmic changes. In this work, we take the first step toward answering this question by investigating a special case, AdaGrad\mathtt{AdaGrad}, the origin of adaptive gradient methods. We provide the first provable convergence rate for AdaGrad\mathtt{AdaGrad} in non-convex optimization when the tail index pp satisfies 4/3<p24/3<p\leq2. Notably, this result is achieved without requiring any prior knowledge of pp and is hence adaptive to the tail index. In addition, we develop an algorithm-dependent lower bound, suggesting that the existing minimax rate for heavy-tailed optimization is not attainable by AdaGrad\mathtt{AdaGrad}. Lastly, we consider AdaGrad-Norm\mathtt{AdaGrad}\text{-}\mathtt{Norm}, a popular variant of AdaGrad\mathtt{AdaGrad} in theoretical studies, and show an improved rate that holds for any 1<p21<p\leq2 under an extra mild assumption.


Source: arXiv:2605.18694v1 - http://arxiv.org/abs/2605.18694v1 PDF: https://arxiv.org/pdf/2605.18694v1 Original Link: http://arxiv.org/abs/2605.18694v1

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Access Paper
View Source PDF
Submission Info
Date:
May 19, 2026
Topic:
Mathematics
Area:
Mathematics
Comments:
0
Bookmark