ExplorerMathematicsMathematics
Research PaperResearchia:202601.30024

YuriiFormer: A Suite of Nesterov-Accelerated Transformers

Aleksandr Zimin

Abstract

We propose a variational framework that interprets transformer layers as iterations of an optimization algorithm acting on token embeddings. In this view, self-attention implements a gradient step of an interaction energy, while MLP layers correspond to gradient updates of a potential energy. Standard GPT-style transformers emerge as vanilla gradient descent on the resulting composite objective, implemented via Lie--Trotter splitting between these two energy functionals. This perspective enables...

Submitted: January 30, 2026Subjects: Mathematics; Mathematics

Description / Details

We propose a variational framework that interprets transformer layers as iterations of an optimization algorithm acting on token embeddings. In this view, self-attention implements a gradient step of an interaction energy, while MLP layers correspond to gradient updates of a potential energy. Standard GPT-style transformers emerge as vanilla gradient descent on the resulting composite objective, implemented via Lie--Trotter splitting between these two energy functionals. This perspective enables principled architectural design using classical optimization ideas. As a proof of concept, we introduce a Nesterov-style accelerated transformer that preserves the same attention and MLP oracles. The resulting architecture consistently outperforms a nanoGPT baseline on TinyStories and OpenWebText, demonstrating that optimization-theoretic insights can translate into practical gains.


Source: arXiv:2601.23236v1 - http://arxiv.org/abs/2601.23236v1 PDF: https://arxiv.org/pdf/2601.23236v1 Original Article: View on arXiv

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Access Paper
View Source PDF
Submission Info
Date:
Jan 30, 2026
Topic:
Mathematics
Area:
Mathematics
Comments:
0
Bookmark