ExplorerData ScienceStatistics
Research PaperResearchia:202606.15033

Free Heavy-Tailed Lunch for Muon: A Theoretical Justification of Empirical Success

Florian Hübler

Abstract

Non-Euclidean optimisation methods with matrix-valued updates, such as Muon and Scion, have recently shown strong empirical performance for training Transformer models, yet their theoretical advantages over Euclidean methods remain poorly understood. We address this gap in the heavy-tailed non-convex regime, where stochastic gradients have bounded $p$-th central moments, $p \in (1,2]$. We show that certain non-Euclidean methods achieve optimal sample complexity under stronger stationarity measur...

Submitted: June 15, 2026Subjects: Statistics; Data Science

Description / Details

Non-Euclidean optimisation methods with matrix-valued updates, such as Muon and Scion, have recently shown strong empirical performance for training Transformer models, yet their theoretical advantages over Euclidean methods remain poorly understood. We address this gap in the heavy-tailed non-convex regime, where stochastic gradients have bounded pp-th central moments, p(1,2]p \in (1,2]. We show that certain non-Euclidean methods achieve optimal sample complexity under stronger stationarity measures, while Euclidean methods incur additional dimension-dependent costs. As a consequence, for m×nm \times n matrices, Muon finds an ε\varepsilon-stationary point in nuclear norm within O(min{m,n}Δ1Lε2(σε)pp1)\mathcal{O}\left(\min\{m, n\} \frac{Δ_1 L}{\varepsilon^2} \left(\frac σ\varepsilon \right)^{\frac p {p-1}}\right) samples, absorbing heavy-tailed noise without extra dimension dependence, unlike Euclidean methods. We further prove this sample complexity, including its dimension dependence, is optimal for all first-order methods under nuclear-norm stationarity. Experiments on large language models support our theory. Surprisingly, our results suggest that other Schatten geometries beyond the spectral geometry of Muon can perform competitively in certain settings.


Source: arXiv:2606.14560v1 - http://arxiv.org/abs/2606.14560v1 PDF: https://arxiv.org/pdf/2606.14560v1 Original Link: http://arxiv.org/abs/2606.14560v1

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Access Paper
View Source PDF
Submission Info
Date:
Jun 15, 2026
Topic:
Data Science
Area:
Statistics
Comments:
0
Bookmark