Transformers

Transformers are neural networks for sequence-to-sequence learning that dispense with recurrence and convolutions, relying instead on attention mechanisms to model dependencies across an entire sequence in parallel.¹² They originated in machine translation but now underpin most foundation models used for language, vision, and multimodal tasks. The canonical layout is an encoder-decoder stack with self-attention and multi-head attention at its core.¹³

Model Families Overview

Architecture	Examples	Primary Use Cases
Encoder-only	BERT, RoBERTa, DistilBERT	Classification, NER, extractive QA
Decoder-only	GPT, LLaMA, Claude	Text generation, code, chat
Encoder-decoder	T5, BART, mT5	Translation, summarization, seq2seq

Architecture Overview

Transformers address sequence transduction: mapping an input sequence to an output sequence. Unlike RNNs, they compute in parallel, enabling efficient training and better handling of long-range interactions via attention.¹

Encoder–decoder pattern

The encoder reads the input and produces a sequence of vector representations (contextual embeddings).
The decoder consumes those representations and iteratively generates the output sequence, one token at a time, conditioning on previously generated tokens.

Encoder

The encoder transforms tokens into contextualized representations for downstream tasks ¹².

Input embeddings

Tokens are mapped to dense vectors via a learned embedding matrix. These embeddings carry semantic information that the model refines layer by layer.

Positional encoding

Because there is no recurrence, order is injected by adding positionalencoding vectors to token embeddings, typically using sinusoidal functions that let the model infer relative and absolute positions ¹.

Stack of encoder layers

The encoder is a stack of identical layers (six in the original paper). Each layer has two sublayers: multiheadattention (self-attention) and a position-wise feed-forward network ¹³.

Multi-head self-attention

Given queries Q, keys K, and values V (learned linear projections of the same inputs), attention computes a score matrix via scaled dot products, applies softmax to obtain attention weights, and then combines values V accordingly. Splitting the projections into h heads lets the model attend to different types of relationships in parallel (multiheadattention) ¹³⁴. This mechanism—selfattention when Q, K, V come from the same sequence—lets each token incorporate information from all other tokens in the sequence in a single operation.

Normalization and residual connections

Each sublayer is wrapped with a residual connection followed by layernormalization to stabilize and speed up training, and to support deeper stacks. The pattern is repeated after the feed-forward sublayer as well ¹.

Position-wise feed-forward network

Each position passes through an identical two-layer MLP with a nonlinearity (ReLU in the original work; GELU is common in later models). The outputs are then combined with residuals and normalization.

Encoder output

The final encoder layer yields a sequence of vectors, each a context-rich representation of the corresponding input token. These vectors serve as keys and values for the decoder’s cross-attention.

Decoder

Output embeddings and positional encoding

Decoder inputs (previous outputs shifted right during training) pass through an embedding layer and the same style of positionalencoding as the encoder ¹.

Masked self-attention

The first attention sublayer masks future positions so predictions for position t can only depend on positions < t. This enforces auto-regressive generation while retaining full-parallel training over positions ¹.

Encoder–decoder attention (cross-attention)

The decoder then attends over the encoder outputs using queries from the decoder and keys/values from the encoder. This aligns generated tokens with relevant source content ¹².

Feed-forward, residuals, and normalization

As in the encoder, a position-wise feed-forward network is applied per token, with residual connections and layernormalization.

Output projection and softmax

A final linear layer projects decoder hidden states to vocabulary logits, followed by softmax to obtain next-token probabilities.

Model families and usage

Encoder-only (BERT): bidirectional pretraining for understanding tasks such as classification, QA (extractive), and token-level tagging ⁵⁶.
Decoderonly (GPT): auto-regressive generation for text synthesis, code, and instruction following ⁷.
Encoderdecoder (T5): text-to-text pretraining unifying diverse tasks under a single interface ⁸.
Conversational models (e.g., LaMDA) specialize the decoder only pattern for dialogue safety and coherence ⁹.

Why It Matters

Parallelism: Training and inference exploit matrix multiplies over entire sequences.
Long-range context: Attention routes information independent of distance.
Modularity: Components (self-attention, multi-head attention, positional encoding, layer normalization) can be scaled, swapped, or extended for domains beyond text.
Transfer learning: Pre-trained transformers can be fine-tuned for specific tasks with minimal data.

Attention Mechanisms - Deep dive into attention types and computations
Prompt Engineering - Techniques for interacting with transformer-based LLMs
BERT - Encoder-only architecture for understanding tasks
Fine-Tuning - Adapting pre-trained transformers to specific tasks

Transformers

Transformers

Model Families Overview

Architecture Overview

Encoder–decoder pattern

Encoder

Input embeddings

Positional encoding

Stack of encoder layers

Multi-head self-attention

Normalization and residual connections

Position-wise feed-forward network

Encoder output

Decoder

Output embeddings and positional encoding

Masked self-attention

Encoder–decoder attention (cross-attention)

Feed-forward, residuals, and normalization

Output projection and softmax

Model families and usage

Why It Matters

References

🔗 Referenced By

Transformers

Model Families Overview

Architecture Overview

Encoder–decoder pattern

Encoder

Input embeddings

Positional encoding

Stack of encoder layers

Multi-head self-attention

Normalization and residual connections

Position-wise feed-forward network

Encoder output

Decoder

Output embeddings and positional encoding

Masked self-attention

Encoder–decoder attention (cross-attention)

Feed-forward, residuals, and normalization

Output projection and softmax

Model families and usage

Why It Matters

Related Topics

References

🔗 Referenced By