Transformers
Transformers are neural networks for sequence-to-sequence learning that dispense with recurrence and convolutions, relying instead on attention mechanisms to model dependencies across an entire sequence in parallel.12 They originated in machine translation but now underpin most foundation models used for language, vision, and multimodal tasks. The canonical layout is an encoder-decoder stack with self-attention and multi-head attention at its core.13
Model Families Overview
| Architecture | Examples | Primary Use Cases |
|---|---|---|
| Encoder-only | BERT, RoBERTa, DistilBERT | Classification, NER, extractive QA |
| Decoder-only | GPT, LLaMA, Claude | Text generation, code, chat |
| Encoder-decoder | T5, BART, mT5 | Translation, summarization, seq2seq |
Architecture Overview
Transformers address sequence transduction: mapping an input sequence to an output sequence. Unlike RNNs, they compute in parallel, enabling efficient training and better handling of long-range interactions via attention.1
Encoder–decoder pattern
- The encoder reads the input and produces a sequence of vector representations (contextual embeddings).
- The decoder consumes those representations and iteratively generates the output sequence, one token at a time, conditioning on previously generated tokens.
Encoder
The encoder transforms tokens into contextualized representations for downstream tasks 12.
Input embeddings
Tokens are mapped to dense vectors via a learned embedding matrix. These embeddings carry semantic information that the model refines layer by layer.
Positional encoding
Because there is no recurrence, order is injected by adding positionalencoding vectors to token embeddings, typically using sinusoidal functions that let the model infer relative and absolute positions 1.
Stack of encoder layers
The encoder is a stack of identical layers (six in the original paper). Each layer has two sublayers: multiheadattention (self-attention) and a position-wise feed-forward network 13.
Multi-head self-attention
Given queries Q, keys K, and values V (learned linear projections of the same inputs), attention computes a score matrix via scaled dot products, applies softmax to obtain attention weights, and then combines values V accordingly. Splitting the projections into h heads lets the model attend to different types of relationships in parallel (multiheadattention) 134. This mechanism—selfattention when Q, K, V come from the same sequence—lets each token incorporate information from all other tokens in the sequence in a single operation.
Normalization and residual connections
Each sublayer is wrapped with a residual connection followed by layernormalization to stabilize and speed up training, and to support deeper stacks. The pattern is repeated after the feed-forward sublayer as well 1.
Position-wise feed-forward network
Each position passes through an identical two-layer MLP with a nonlinearity (ReLU in the original work; GELU is common in later models). The outputs are then combined with residuals and normalization.
Encoder output
The final encoder layer yields a sequence of vectors, each a context-rich representation of the corresponding input token. These vectors serve as keys and values for the decoder’s cross-attention.
Decoder
Output embeddings and positional encoding
Decoder inputs (previous outputs shifted right during training) pass through an embedding layer and the same style of positionalencoding as the encoder 1.
Masked self-attention
The first attention sublayer masks future positions so predictions for position t can only depend on positions < t. This enforces auto-regressive generation while retaining full-parallel training over positions 1.
Encoder–decoder attention (cross-attention)
The decoder then attends over the encoder outputs using queries from the decoder and keys/values from the encoder. This aligns generated tokens with relevant source content 12.
Feed-forward, residuals, and normalization
As in the encoder, a position-wise feed-forward network is applied per token, with residual connections and layernormalization.
Output projection and softmax
A final linear layer projects decoder hidden states to vocabulary logits, followed by softmax to obtain next-token probabilities.
Model families and usage
- Encoder-only (BERT): bidirectional pretraining for understanding tasks such as classification, QA (extractive), and token-level tagging 56.
- Decoderonly (GPT): auto-regressive generation for text synthesis, code, and instruction following 7.
- Encoderdecoder (T5): text-to-text pretraining unifying diverse tasks under a single interface 8.
- Conversational models (e.g., LaMDA) specialize the decoder only pattern for dialogue safety and coherence 9.
Why It Matters
- Parallelism: Training and inference exploit matrix multiplies over entire sequences.
- Long-range context: Attention routes information independent of distance.
- Modularity: Components (self-attention, multi-head attention, positional encoding, layer normalization) can be scaled, swapped, or extended for domains beyond text.
- Transfer learning: Pre-trained transformers can be fine-tuned for specific tasks with minimal data.
Related Topics
- Attention Mechanisms - Deep dive into attention types and computations
- Prompt Engineering - Techniques for interacting with transformer-based LLMs
- BERT - Encoder-only architecture for understanding tasks
- Fine-Tuning - Adapting pre-trained transformers to specific tasks