Attention Mechanisms

The attention mechanism is a fundamental component of modern neural networks that allows models to focus on the most relevant parts of the input when producing an output. Unlike recurrent architectures that process sequences step-by-step, attention enables direct connections between any positions in a sequence, learning dependencies regardless of their distance.1


Query, Key, and Value

Each token is represented by three vectors:

  • Query (Q): What the token is looking for
  • Key (K): What the token offers
  • Value (V): The information contained

The attention computation is defined as:

$$ \text{Attention}(Q, K, V) = \text{softmax}\left( \frac{QK^T}{\sqrt{d_k}} \right) V $$

Types of Attention Mechanisms

Additive vs. Multiplicative Attention

Additive

Computes attention score using a small neural network 2 $$ e_{ij} = v^T \tanh(W_q q_i + W_k k_j) $$

Multiplicative

Computes score using dot product 3 $$ e_{ij} = q_i^T k_j $$

Scaled Dot-Product Attention

$$ \text{Attention}(Q, K, V) = \text{softmax}\left( \frac{QK^T}{\sqrt{d_k}} \right) V $$

The scaling by $\sqrt{d_k}$ prevents large dot-product values that could destabilize softmax.

Self-Attention

Self-attention allows each token to attend to all other tokens in the same sequence, including itself. It is used in both encoder and decoder blocks.

Cross-Attention

Cross-attention connects encoder outputs to the decoder. It enables the decoder to focus on specific encoder outputs when generating tokens.

Multi-Head Attention

Multi-head attention splits the attention mechanism into multiple heads, each learning distinct relationships.

$$ \text{MultiHead}(Q,K,V) = \text{Concat}(\text{head}_1, …, \text{head}_h) W^O $$

Each head is computed as:

$$ \text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V) $$

Masked Attention

Masked attention prevents a token from attending to future tokens during training. This ensures predictions depend only on past tokens.

Sparse, Local, and Windowed Attention

Used to handle long sequences efficiently by reducing the quadratic complexity $O(n^2)$.
Variants include:

  • Sparse Attention: Selective token attention
  • Local Attention: Fixed window range
  • Sliding Window Attention: Combination of local and global context

Hierarchical Attention

Hierarchical attention operates at multiple levels—such as word, sentence, and paragraph—to model document structures.

Cross-Modal Attention

Cross-modal attention links different modalities (e.g., text, image, audio) by using cross-attention between encoders of each modality.

Relative and Rotary Positional Attention

Newer transformer models integrate relative position directly into attention score computation.

Relative position example:

$$ \text{Attention}(Q, K, V) = \text{softmax}\left( \frac{Q(K + R)^T}{\sqrt{d_k}} \right) V $$

Summary Table

TypeUsed InPurpose
AdditiveRNNsLearn flexible alignment
MultiplicativeRNNs / TransformersFaster scoring
Scaled Dot-ProductTransformersStable computation
Self-AttentionEncoder / DecoderRelate tokens in same sequence
Cross-AttentionDecoderLink input and output
Masked AttentionDecoder (training)Prevent future access
Multi-Head AttentionTransformerLearn multiple relationships
Sparse/Local/WindowedLong Sequence ModelsReduce computational cost
HierarchicalDocumentsMulti-level understanding
Cross-ModalVision, Audio, TextFuse modalities
Relative/Rotary PositionalTransformer VariantsCapture positional context

  • Transformers - Architecture built on attention mechanisms
  • BERT - Encoder-only model using self-attention
  • Prompt Engineering - Working with attention-based LLMs

References


  1. Vaswani, A., et al. (2017). Attention Is All You Need. NeurIPS. https://arxiv.org/abs/1706.03762 ↩︎

  2. Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural Machine Translation by Jointly Learning to Align and Translate. arXiv:1409.0473. https://arxiv.org/abs/1409.0473 ↩︎

  3. Luong, M. T., Pham, H., & Manning, C. D. (2015). Effective Approaches to Attention-based Neural Machine Translation. EMNLP. https://arxiv.org/abs/1508.04025 ↩︎