ExplorerMathematicsMathematics
Research PaperResearchia:202607.01022

Random Reshuffling Dominates Stochastic Gradient Descent

Zijian Liu

Abstract

Stochastic Gradient Descent ($\textsf{SGD}$) is one of the most classical optimization algorithms with favorable theoretical guarantees, yet the practical implementation of $\textsf{SGD}$ differs subtly from its well-known form and is often referred to as Shuffling Stochastic Gradient Descent ($\textsf{Shuffling SGD}$). A particularly popular strategy in $\textsf{Shuffling SGD}$ is Random Reshuffling ($\textsf{RR}$), which has achieved great empirical success across numerous experiments. Despite...

Submitted: July 1, 2026Subjects: Mathematics; Mathematics

Description / Details

Stochastic Gradient Descent (SGD\textsf{SGD}) is one of the most classical optimization algorithms with favorable theoretical guarantees, yet the practical implementation of SGD\textsf{SGD} differs subtly from its well-known form and is often referred to as Shuffling Stochastic Gradient Descent (Shuffling SGD\textsf{Shuffling SGD}). A particularly popular strategy in Shuffling SGD\textsf{Shuffling SGD} is Random Reshuffling (RR\textsf{RR}), which has achieved great empirical success across numerous experiments. Despite its strong performance, RR\textsf{RR} has long been considered a heuristic due to a lack of theoretical support. Over the last decade, people have finally established provable convergence rates for RR\textsf{RR}, thus justifying its observed superiority. However, for smooth convex optimization, two clouds over the convergence theory of RR\textsf{RR} remain to this day. More precisely, according to the current theory, Shuffling SGD\textsf{Shuffling SGD} under RR\textsf{RR} converges only when the stepsize is smaller than a threshold proportional to 1/n1/n, where nn is the number of summands in the objective (or the number of data points). Consequently, the optimally tuned theoretical rate of Shuffling SGD\textsf{Shuffling SGD} under RR\textsf{RR} is strictly worse than that of SGD\textsf{SGD} when the number of epochs is smaller than another threshold proportional to nn. These two restrictions heavily limit the applicability of existing theories and leave a critical mismatch with practice. In this work, for the first time, we prove that RR\textsf{RR} dominates SGD\textsf{SGD} in smooth convex optimization under any reasonable stepsize after any finite number of epochs, thereby addressing a longstanding open question.


Source: arXiv:2606.32005v1 - http://arxiv.org/abs/2606.32005v1 PDF: https://arxiv.org/pdf/2606.32005v1 Original Link: http://arxiv.org/abs/2606.32005v1

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Access Paper
View Source PDF
Submission Info
Date:
Jul 1, 2026
Topic:
Mathematics
Area:
Mathematics
Comments:
0
Bookmark