Back to Explorer
Research PaperResearchia:202603.04052[Artificial Intelligence > AI]

SageBwd: A Trainable Low-bit Attention

Jintao Zhang

Abstract

Low-bit attention, such as SageAttention, has emerged as an effective approach for accelerating model inference, but its applicability to training remains poorly understood. In prior work, we introduced SageBwd, a trainable INT8 attention that quantizes six of seven attention matrix multiplications while preserving fine-tuning performance. However, SageBwd exhibited a persistent performance gap to full-precision attention (FPA) during pre-training. In this work, we investigate why this gap occurs and demonstrate that SageBwd matches full-precision attention during pretraining. Through experiments and theoretical analysis, we reach a few important insights and conclusions: (i) QK-norm is necessary for stable training at large tokens per step, (ii) quantization errors primarily arise from the backward-pass score gradient dS, (iii) reducing tokens per step enables SageBwd to match FPA performance in pre-training, and (iv) K-smoothing remains essential for training stability, while Q-smoothing provides limited benefit during pre-training.


Source: arXiv:2603.02170v1 - http://arxiv.org/abs/2603.02170v1 PDF: https://arxiv.org/pdf/2603.02170v1 Original Link: http://arxiv.org/abs/2603.02170v1

Submission:3/4/2026
Comments:0 comments
Subjects:AI; Artificial Intelligence
Original Source:
View Original PDF
arXiv: This paper is hosted on arXiv, an open-access repository
Was this helpful?

Discussion (0)

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

SageBwd: A Trainable Low-bit Attention | Researchia