Back to Explorer
Research PaperResearchia:202512.25bce784[Machine Learning > Data Science]

Attention (machine learning)

Marcus Jensen (University of Copenhagen)

Abstract

Attention (machine learning)

In machine learning, attention is a method that determines the importance of each component in a sequence relative to the other components in that sequence. In natural language processing, importance is represented by "soft" weights assigned to each word in a sentence. More generally, attention encodes vectors called token embeddings across a fixed-width sequence that can range from tens to millions of tokens in size. Unlike "hard" weights, which are computed during the backwards training pass, "soft" weights exist only in the forward pass and therefore change with every step of the input. Earlier designs implemented the attention mechanism in a serial recurrent neural network (RNN) language translation system, but a more recent design, namely the transformer, removed the slower sequential RNN and relied more heavily on the faster parallel attention scheme. Inspired by ideas about attention in humans, the attention mechanism was developed to address the weaknesses of using information from the hidden layers of recurrent neural networks. Recurrent neural networks favor more recent information contained in words at the end of a sentence, while information earlier in the sentence tends to be attenuated. Attention allows a token equal access to any part of a sentence directly, rather than only through the previous state.

== History ==

Additional surveys of the attention mechanism in deep learning are provided by Niu et al. and Soydaner. The major breakthrough came with self-attention, where each element in the input sequence attends to all others, enabling the model to capture global dependencies. This idea was central to the Transformer architecture, which replaced recurrence with attention mechanisms. As a result, Transformers became the foundation for models like BERT, T5 and generative pre-trained transformers (GPT).

== Overview ==

The modern era of machine attention was revitalized by grafting an attention mechanism (Fig 1. orange) to an Encoder-Decoder.

Figure 2 shows the internal step-by-step operation of the attention block (A) in Fig 1.

=== Interpreting attention weights === In translating between languages, alignment is the process of matching words from the source sentence to words of the translated sentence. Networks that perform verbatim translation without regard to word order would show the highest scores along the (dominant) diagonal of the matrix. The off-diagonal dominance shows that the attention mechanism is more nuanced. Consider an example of translating I love you to French. On the first pass through the decoder, 94% of the attention weight is on the first English word I, so the network offers the word je. On the second pass of the decoder, 88% of the attention weight is on the third English word you, so it offers t'. On the last pass, 95% of the attention weight is on the second English word love, so it offers aime. In the I love you example, the second word love is aligned with the third word aime. Stacking soft row vectors together for je, t', and aime yields an alignment matrix:

Sometimes, alignment can be multiple-to-multiple. For example, the English phrase look it up corresponds to cherchez-le. Thus, "soft" attention weights work better than "hard" attention weights (setting one attention weight to 1, and the others to 0), as we would like the model to make a context vector consisting of a weighted sum of the hidden vectors, rather than "the best one", as there may not be a best hidden vector.

== Variants ==

Many variants of attention implement soft weights, such as

fast weight programmers, or fast weight controllers (1992). A "slow" neural network outputs the "fast" weights of another neural network through outer products. The slow network learns by gradient descent. It was later renamed as "linearized self-attention". Bahdanau-style attention, also referred to as additive attention, Luong-style attention, which is known as multiplicative attention, Early attention mechanisms similar to modern self-attention were proposed using recurrent neural networks. However, the highly parallelizable self-attention was introduced in 2017 and successfully used in the Transformer model, positional attention and factorized positional attention. For convolutional neural networks, attention mechanisms can be distinguished by the dimension on which they operate, namely: spatial attention, channel attention, or combinations. These variants recombine the encoder-side inputs to redistribute those effects to each target output. Often, a correlation-style matrix of dot products provides the re-weighting coefficients. In the figures below, W is the matrix of context attention weights, similar to the formula in Overview section above.

== Optimizations ==

=== Flash attention === The size of the attention matrix is proportional to the square of the number of input tokens. Therefore, when the input is long, calculating the attention matrix requires a lot of GPU memory. Flash attention is an implementation that reduces the memory needs and increases efficiency without sacrificing accuracy. It achieves this by partitioning the attention computation into smaller blocks that fit into the GPU's faster on-chip memory, reducing the need to store large intermediate matrices and thus lowering memory usage while increasing computational efficiency.

=== FlexAttention === FlexAttention is an attention kernel developed by Meta that allows users to modify attention scores prior to softmax and dynamically chooses the optimal attention algorithm.

== Applications == Attention is widely used in natural language processing, computer vision, and speech recognition. In NLP, it improves context understanding in tasks like question answering and summarization. In vision, visual attention helps models focus on relevant image regions, enhancing object detection and image captioning.

=== Attention maps as explanations for vision transformers ===

From the original paper on vision transformers (ViT), visualizing attention scores as a heat map (called saliency maps or attention maps) has become an important and routine way to inspect the decision making process of ViT models. One can compute the attention maps with respect to any attention head at any layer, while the deeper layers tend to show more semantically meaningful visualization. Attention rollout is a recursive algorithm to combine attention scores across all layers, by computing the dot product of successive attention maps. Because vision transformers are typically trained in a self-supervised manner, attention maps are generally not class-sensitive. When a classification head attached to the ViT backbone, class-discriminative attention maps (CDAM) combines attention maps and gradients with respect to the class [CLS] token. Some class-sensitive interpretability methods originally developed for convolutional neural networks can be also applied to ViT, such as GradCAM, which back-propagates the gradients to the outputs of the final attention layer. Using attention as basis of explanation for the transformers in language and vision is not without debate. While some pioneering papers analyzed and framed attention scores as explanations, higher attention scores do not always correlate with greater impact on model performances.

== Mathematical representation ==

=== Standard scaled dot-product attention === For matrices: V∈RnΓ—dvV\in \mathbb {R} ^{n\times d_{v}}
, the scaled dot-product, or QKV attention, is defined as: mm
-by- VV
. To understand the permutation invariance and permutation equivariance properties of QKV attention, let QQ
); and invariant to re-ordering of the key-value pairs in K,VK,V
. These properties are inherited when applying linear transforms to the inputs and outputs of QKV attention blocks. For example, a simple self-attention function defined as: XX

=== Masked attention === When QKV attention is used as a building block for an autoregressive decoder, and when at training time all input and output matrices have 1≀i<j≀n1\leq i<j\leq n
, row jj

=== Multi-head attention ===

Multi-head attention WiQ,WiK,WiVW_{i}^{Q},W_{i}^{K},W_{i}^{V}
, and WOW^{O} The permutation properties of (standard, unmasked) QKV attention apply here also. For permutation matrices, A,BA,B
: XX
.

=== Bahdanau (additive) attention === WKW_{K}

=== Luong attention (general) ===

      Attention
    
    (
    Q
    ,
    K
    ,
    V
    )
    =
    
      softmax
    
    (
   ...

(Article truncated for display)

Source

This content is sourced from Wikipedia, the free encyclopedia. Read full article on Wikipedia

Category

Machine Learning - Data Science

Submission:12/25/2025
Comments:0 comments
Subjects:Data Science; Machine Learning
Was this helpful?

Discussion (0)

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Attention (machine learning) | Researchia