Back to Explorer
Research PaperResearchia:202601.30024[Mathematics > Mathematics]

YuriiFormer: A Suite of Nesterov-Accelerated Transformers

Aleksandr Zimin

Abstract

We propose a variational framework that interprets transformer layers as iterations of an optimization algorithm acting on token embeddings. In this view, self-attention implements a gradient step of an interaction energy, while MLP layers correspond to gradient updates of a potential energy. Standard GPT-style transformers emerge as vanilla gradient descent on the resulting composite objective, implemented via Lie--Trotter splitting between these two energy functionals. This perspective enables principled architectural design using classical optimization ideas. As a proof of concept, we introduce a Nesterov-style accelerated transformer that preserves the same attention and MLP oracles. The resulting architecture consistently outperforms a nanoGPT baseline on TinyStories and OpenWebText, demonstrating that optimization-theoretic insights can translate into practical gains.


Source: arXiv:2601.23236v1 - http://arxiv.org/abs/2601.23236v1 PDF: https://arxiv.org/pdf/2601.23236v1 Original Article: View on arXiv

Submission:1/30/2026
Comments:0 comments
Subjects:Mathematics; Mathematics
Original Source:
View Original PDF
arXiv: This paper is hosted on arXiv, an open-access repository
Was this helpful?

Discussion (0)

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!