A Separable Architecture for Continuous Token Representation in Language Models
Abstract
Transformer scaling law analyses typically treat parameters as interchangeable; an abstraction that accurately predicts loss-compute relationships. Yet, in sub-billion-parameter small language models (SLMs), embedding matrices dominate the parameter budget. This work argues that this allocation is as suboptimal as it is counterintuitive. Leviathan is an architecture with a continuous embedding generator to replace the discrete lookup tables of canonical models. Evaluating on the Pile dataset und...
Description / Details
Transformer scaling law analyses typically treat parameters as interchangeable; an abstraction that accurately predicts loss-compute relationships. Yet, in sub-billion-parameter small language models (SLMs), embedding matrices dominate the parameter budget. This work argues that this allocation is as suboptimal as it is counterintuitive. Leviathan is an architecture with a continuous embedding generator to replace the discrete lookup tables of canonical models. Evaluating on the Pile dataset under isoparametric settings, Leviathan consistently outperforms a standard, LLaMA-style architecture. By means of an empirical power-law fit, Leviathan exhibits a markedly superior effective parameter capacity. Across the regime studied, Leviathan behaves as a dense model with to more parameters.
Source: arXiv:2601.22040v1 - http://arxiv.org/abs/2601.22040v1 PDF: https://arxiv.org/pdf/2601.22040v1 Original Link: http://arxiv.org/abs/2601.22040v1
Please sign in to join the discussion.
No comments yet. Be the first to share your thoughts!
Jan 29, 2026
Artificial Intelligence
Artificial Intelligence
0