Attention Mechanisms
The attention mechanism is a fundamental component of modern neural networks that allows models to focus on the most relevant parts of the input when producing an output. Unlike recurrent architectures that process sequences step-by-step, attention enables direct connections between any positions in a sequence, learning dependencies regardless of their distance.1
Query, Key, and Value
Each token is represented by three vectors:
- Query (Q): What the token is looking for
- Key (K): What the token offers
- Value (V): The information contained
The attention computation is defined as:
$$ \text{Attention}(Q, K, V) = \text{softmax}\left( \frac{QK^T}{\sqrt{d_k}} \right) V $$
Types of Attention Mechanisms
Additive vs. Multiplicative Attention
Additive
Computes attention score using a small neural network 2 $$ e_{ij} = v^T \tanh(W_q q_i + W_k k_j) $$
Multiplicative
Computes score using dot product 3 $$ e_{ij} = q_i^T k_j $$
Scaled Dot-Product Attention
$$ \text{Attention}(Q, K, V) = \text{softmax}\left( \frac{QK^T}{\sqrt{d_k}} \right) V $$
The scaling by $\sqrt{d_k}$ prevents large dot-product values that could destabilize softmax.
Self-Attention
Self-attention allows each token to attend to all other tokens in the same sequence, including itself. It is used in both encoder and decoder blocks.
Cross-Attention
Cross-attention connects encoder outputs to the decoder. It enables the decoder to focus on specific encoder outputs when generating tokens.
Multi-Head Attention
Multi-head attention splits the attention mechanism into multiple heads, each learning distinct relationships.
$$ \text{MultiHead}(Q,K,V) = \text{Concat}(\text{head}_1, …, \text{head}_h) W^O $$
Each head is computed as:
$$ \text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V) $$
Masked Attention
Masked attention prevents a token from attending to future tokens during training. This ensures predictions depend only on past tokens.
Sparse, Local, and Windowed Attention
Used to handle long sequences efficiently by reducing the quadratic complexity $O(n^2)$.
Variants include:
- Sparse Attention: Selective token attention
- Local Attention: Fixed window range
- Sliding Window Attention: Combination of local and global context
Hierarchical Attention
Hierarchical attention operates at multiple levels—such as word, sentence, and paragraph—to model document structures.
Cross-Modal Attention
Cross-modal attention links different modalities (e.g., text, image, audio) by using cross-attention between encoders of each modality.
Relative and Rotary Positional Attention
Newer transformer models integrate relative position directly into attention score computation.
Relative position example:
$$ \text{Attention}(Q, K, V) = \text{softmax}\left( \frac{Q(K + R)^T}{\sqrt{d_k}} \right) V $$
Summary Table
| Type | Used In | Purpose |
|---|---|---|
| Additive | RNNs | Learn flexible alignment |
| Multiplicative | RNNs / Transformers | Faster scoring |
| Scaled Dot-Product | Transformers | Stable computation |
| Self-Attention | Encoder / Decoder | Relate tokens in same sequence |
| Cross-Attention | Decoder | Link input and output |
| Masked Attention | Decoder (training) | Prevent future access |
| Multi-Head Attention | Transformer | Learn multiple relationships |
| Sparse/Local/Windowed | Long Sequence Models | Reduce computational cost |
| Hierarchical | Documents | Multi-level understanding |
| Cross-Modal | Vision, Audio, Text | Fuse modalities |
| Relative/Rotary Positional | Transformer Variants | Capture positional context |
Related Topics
- Transformers - Architecture built on attention mechanisms
- BERT - Encoder-only model using self-attention
- Prompt Engineering - Working with attention-based LLMs
References
Vaswani, A., et al. (2017). Attention Is All You Need. NeurIPS. https://arxiv.org/abs/1706.03762 ↩︎
Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural Machine Translation by Jointly Learning to Align and Translate. arXiv:1409.0473. https://arxiv.org/abs/1409.0473 ↩︎
Luong, M. T., Pham, H., & Manning, C. D. (2015). Effective Approaches to Attention-based Neural Machine Translation. EMNLP. https://arxiv.org/abs/1508.04025 ↩︎