Explorerโ€บData Scienceโ€บMachine Learning
Research PaperResearchia:202603.19005

Beyond Muon: MUD (MomentUm Decorrelation) for Faster Transformer Training

Ben S. Southworth

Abstract

Orthogonalized-momentum optimizers such as Muon improve transformer training by approximately whitening/orthogonalizing matrix-valued momentum updates via a short polar-decomposition iteration. However, polar-factor approximations typically require multiple large matrix multiplications, and the resulting overhead can be substantial and hardware-dependent. We introduce MUD (MomentUm Decorrelation), a complementary whitening approach that replaces Muon's polar update with a triangular (Cholesky-li...

Submitted: March 19, 2026Subjects: Machine Learning; Data Science

Description / Details

Orthogonalized-momentum optimizers such as Muon improve transformer training by approximately whitening/orthogonalizing matrix-valued momentum updates via a short polar-decomposition iteration. However, polar-factor approximations typically require multiple large matrix multiplications, and the resulting overhead can be substantial and hardware-dependent. We introduce MUD (MomentUm Decorrelation), a complementary whitening approach that replaces Muon's polar update with a triangular (Cholesky-like) whitening surrogate inspired by classical Gram--Schmidt and Gauss-Seidel ideas. We show that row-orthonormal matrices are fixed points of the MUD map, relate the inner step to symmetric Gauss-Seidel preconditioning of the Gram matrix, and prove quadratic local convergence near the fixed point. In terms of time-to-perplexity, MUD yields consistent 10-50% wall-clock improvements over tuned AdamW and Muon in time-to-perplexity, typically converging slightly slower per step than Muon but with substantially lower optimizer overhead -- relative to Muon, MUD improves peak tokens/s by roughly 1.3โˆ’2.6ร—1.3-2.6\times across most settings and up to nearly 3ร—3\times on GPT-2 large on an A100. We also demonstrate training a ESM-2 150M protein language model, where MUD matches Muon-level validation perplexity in significantly less wall-clock time.


Source: arXiv:2603.17970v1 - http://arxiv.org/abs/2603.17970v1 PDF: https://arxiv.org/pdf/2603.17970v1 Original Link: http://arxiv.org/abs/2603.17970v1

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Access Paper
View Source PDF
Submission Info
Date:
Mar 19, 2026
Topic:
Data Science
Area:
Machine Learning
Comments:
0
Bookmark