Attention Mechanisms

The attention mechanism is a fundamental component of modern neural networks that allows models to focus on the most relevant parts of the input when producing an output. Unlike recurrent architectures that process sequences step-by-step, attention enables direct connections between any positions in a sequence, learning dependencies regardless of their distance.¹

Query, Key, and Value

Each token is represented by three vectors:

Query (Q): What the token is looking for
Key (K): What the token offers
Value (V): The information contained

The attention computation is defined as:

$$ \text{Attention}(Q, K, V) = \text{softmax}\left( \frac{QK^T}{\sqrt{d_k}} \right) V $$

Types of Attention Mechanisms

Additive vs. Multiplicative Attention

Additive

Computes attention score using a small neural network ² $$ e_{ij} = v^T \tanh(W_q q_i + W_k k_j) $$

Multiplicative

Computes score using dot product ³ $$ e_{ij} = q_i^T k_j $$

Scaled Dot-Product Attention

$$ \text{Attention}(Q, K, V) = \text{softmax}\left( \frac{QK^T}{\sqrt{d_k}} \right) V $$

The scaling by $\sqrt{d_k}$ prevents large dot-product values that could destabilize softmax.

Self-Attention

Self-attention allows each token to attend to all other tokens in the same sequence, including itself. It is used in both encoder and decoder blocks.

Cross-Attention

Cross-attention connects encoder outputs to the decoder. It enables the decoder to focus on specific encoder outputs when generating tokens.

Multi-Head Attention

Multi-head attention splits the attention mechanism into multiple heads, each learning distinct relationships.

$$ \text{MultiHead}(Q,K,V) = \text{Concat}(\text{head}_1, …, \text{head}_h) W^O $$

Each head is computed as:

$$ \text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V) $$

Masked Attention

Masked attention prevents a token from attending to future tokens during training. This ensures predictions depend only on past tokens.

Sparse, Local, and Windowed Attention

Used to handle long sequences efficiently by reducing the quadratic complexity $O(n^2)$.
Variants include:

Sparse Attention: Selective token attention
Local Attention: Fixed window range
Sliding Window Attention: Combination of local and global context

Hierarchical Attention

Hierarchical attention operates at multiple levels—such as word, sentence, and paragraph—to model document structures.

Cross-modal attention links different modalities (e.g., text, image, audio) by using cross-attention between encoders of each modality.

Relative and Rotary Positional Attention

Newer transformer models integrate relative position directly into attention score computation.

Relative position example:

$$ \text{Attention}(Q, K, V) = \text{softmax}\left( \frac{Q(K + R)^T}{\sqrt{d_k}} \right) V $$

Summary Table

Type	Used In	Purpose
Additive	RNNs	Learn flexible alignment
Multiplicative	RNNs / Transformers	Faster scoring
Scaled Dot-Product	Transformers	Stable computation
Self-Attention	Encoder / Decoder	Relate tokens in same sequence
Cross-Attention	Decoder	Link input and output
Masked Attention	Decoder (training)	Prevent future access
Multi-Head Attention	Transformer	Learn multiple relationships
Sparse/Local/Windowed	Long Sequence Models	Reduce computational cost
Hierarchical	Documents	Multi-level understanding
Cross-Modal	Vision, Audio, Text	Fuse modalities
Relative/Rotary Positional	Transformer Variants	Capture positional context

Transformers - Architecture built on attention mechanisms
BERT - Encoder-only model using self-attention
Prompt Engineering - Working with attention-based LLMs

References

Vaswani, A., et al. (2017). Attention Is All You Need. NeurIPS. https://arxiv.org/abs/1706.03762 ↩︎
Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural Machine Translation by Jointly Learning to Align and Translate. arXiv:1409.0473. https://arxiv.org/abs/1409.0473 ↩︎
Luong, M. T., Pham, H., & Manning, C. D. (2015). Effective Approaches to Attention-based Neural Machine Translation. EMNLP. https://arxiv.org/abs/1508.04025 ↩︎