BERT: Bidirectional Encoder Representations from Transformers
BERT, introduced by Google AI in 2018, represents a paradigm shift in natural language processing. Unlike previous models that processed text left-to-right or right-to-left, BERT simultaneously considers context from both directions, enabling a deeper understanding of language semantics.1
Core Innovation: Bidirectional Context
Traditional language models like GPT use unidirectional context (left-to-right), which limits their understanding. For example, in the sentence “I went to the bank to deposit money,” a left-to-right model processing “bank” cannot see “deposit money” to disambiguate meaning.
BERT solves this by using a masked language modeling approach that allows it to look at the entire sentence simultaneously:
Unidirectional: [I] [went] [to] [the] [bank] → predicts next word
Bidirectional: [I] [went] [to] [the] [MASK] [to] [deposit] [money] → predicts "bank"
Architecture Overview
BERT is built entirely on the Transformer encoder architecture. The original paper introduced two model sizes:
| Model | Layers | Hidden Size | Attention Heads | Parameters |
|---|---|---|---|---|
| BERT-Base | 12 | 768 | 12 | 110M |
| BERT-Large | 24 | 1024 | 16 | 340M |
Input Representation
BERT’s input is the sum of three embeddings:
- Token Embeddings: WordPiece tokenization with a 30,000 token vocabulary
- Segment Embeddings: Distinguishes between sentence A and sentence B for sentence-pair tasks
- Position Embeddings: Learned positional encodings (max 512 tokens)
Special tokens:
[CLS]: Classification token, placed at the beginning[SEP]: Separator token between sentences[MASK]: Mask token for MLM pre-training
Pre-Training Objectives
BERT is pre-trained on two unsupervised tasks using large text corpora (BooksCorpus and English Wikipedia).
1. Masked Language Modeling (MLM)
Randomly mask 15% of input tokens and predict them:
- 80% of the time: Replace with
[MASK] - 10% of the time: Replace with a random token
- 10% of the time: Keep the original token
This prevents the model from simply learning to ignore [MASK] tokens.
Training objective:
$$ \mathcal{L}{\text{MLM}} = -\sum{i \in \mathcal{M}} \log P(x_i | x_{\backslash \mathcal{M}}) $$
Where $\mathcal{M}$ is the set of masked positions.
2. Next Sentence Prediction (NSP)
A binary classification task to predict whether sentence B follows sentence A:
- 50% of training pairs: B is the actual next sentence (label: IsNext)
- 50% of training pairs: B is a random sentence (label: NotNext)
This helps BERT understand sentence-level relationships for downstream tasks like question answering.2
Fine-Tuning for Downstream Tasks
One of BERT’s key strengths is its ability to be fine-tuned for specific tasks with minimal architectural changes:
Task Types and Approaches
| Task Type | Example | Output Source |
|---|---|---|
| Single Sentence Classification | Sentiment Analysis | [CLS] token representation |
| Sentence Pair Classification | Natural Language Inference | [CLS] token representation |
| Question Answering | SQuAD | Start/end token positions |
| Token Classification | Named Entity Recognition | Per-token representations |
Fine-Tuning Process
- Initialize with pre-trained BERT weights
- Add a task-specific output layer (typically a single linear layer)
- Fine-tune all parameters end-to-end on labeled data
- Typical training: 2-4 epochs, learning rate 2e-5 to 5e-5
BERT Variants and Extensions
The success of BERT spawned numerous variants addressing its limitations:
| Variant | Innovation | Key Improvement |
|---|---|---|
| RoBERTa | Robust training | Removed NSP, larger batches, more data |
| ALBERT | Parameter efficiency | Factorized embeddings, cross-layer sharing |
| DistilBERT | Model compression | 40% smaller, 60% faster, retains 97% performance |
| SpanBERT | Span masking | Better for extractive tasks |
| XLNet | Permutation language modeling | Captures bidirectional context without masks |
Multilingual BERT (mBERT)
BERT trained on 104 languages simultaneously, enabling cross-lingual transfer learning where a model fine-tuned on English can perform well on other languages.3
Practical Implementation
Using BERT with Hugging Face Transformers
from transformers import BertTokenizer, BertModel
# Load pre-trained model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
# Tokenize input
text = "BERT revolutionized natural language processing."
inputs = tokenizer(text, return_tensors="pt")
# Get embeddings
outputs = model(**inputs)
last_hidden_states = outputs.last_hidden_state # Shape: [batch, seq_len, hidden_size]
pooled_output = outputs.pooler_output # Shape: [batch, hidden_size]
Computational Considerations
- Memory: BERT’s self-attention has $O(n^2)$ memory complexity
- Maximum sequence length: 512 tokens (architectural constraint)
- Inference time: Significantly slower than traditional models
- GPU requirements: Fine-tuning BERT-Base requires at least 12GB VRAM
Comparison with Other Models
| Feature | BERT | GPT | ELMo |
|---|---|---|---|
| Architecture | Transformer Encoder | Transformer Decoder | BiLSTM |
| Context | Bidirectional | Unidirectional (left-to-right) | Bidirectional (concatenated) |
| Pre-training | MLM + NSP | Language Modeling | Language Modeling |
| Fine-tuning | End-to-end | Few-shot/Zero-shot | Feature extraction |
| Best For | Understanding tasks | Generation tasks | Feature extraction |
Limitations
- Fixed context window: Maximum 512 tokens limits processing of long documents
- Computational cost: Large memory and compute requirements
- Pre-training/fine-tuning gap:
[MASK]tokens never appear during fine-tuning - NSP controversy: Later research (RoBERTa) showed NSP may not be necessary
- Static embeddings: Once fine-tuned, embeddings do not adapt to new contexts
Related Topics
- Transformers - The foundational architecture
- Attention Mechanisms - Core component of BERT
- Word2Vec - Earlier word embedding approach
- Vector Embeddings - General embedding concepts
- Semantic Search - Common BERT application
References
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL-HLT 2019. https://arxiv.org/abs/1810.04805 ↩︎
Hugging Face. (2024). BERT Documentation. Hugging Face Transformers. https://huggingface.co/docs/transformers/model_doc/bert ↩︎
Pires, T., Schlinger, E., & Garrette, D. (2019). How Multilingual is Multilingual BERT? ACL 2019. https://arxiv.org/abs/1906.01502 ↩︎