LAMP: Look-Ahead Mixed-Precision Inference of Large Language Models
Abstract
Mixed-precision computations are a hallmark of the current stage of AI, driving the progress in large language models towards efficient, locally deployable solutions. This article addresses the floating-point computation of compositionally-rich functions, concentrating on transformer inference. Based on the rounding error analysis of a composition , we provide an adaptive strategy that selects a small subset of components of to be computed more accurately while all other computations can be carried out with lower accuracy. We then explain how this strategy can be applied to different compositions within a transformer and illustrate its overall effect on transformer inference. We study the effectiveness of this algorithm numerically on GPT-2 models and demonstrate that already very low recomputation rates allow for improvements of up to two orders of magnitude in accuracy.
Source: arXiv:2601.21623v1 - http://arxiv.org/abs/2601.21623v1 PDF: https://arxiv.org/pdf/2601.21623v1 Original Link: http://arxiv.org/abs/2601.21623v1