BERT: Bidirectional Encoder Representations from Transformers

BERT, introduced by Google AI in 2018, represents a paradigm shift in natural language processing. Unlike previous models that processed text left-to-right or right-to-left, BERT simultaneously considers context from both directions, enabling a deeper understanding of language semantics.¹

Core Innovation: Bidirectional Context

Traditional language models like GPT use unidirectional context (left-to-right), which limits their understanding. For example, in the sentence “I went to the bank to deposit money,” a left-to-right model processing “bank” cannot see “deposit money” to disambiguate meaning.

BERT solves this by using a masked language modeling approach that allows it to look at the entire sentence simultaneously:

Unidirectional: [I] [went] [to] [the] [bank] → predicts next word
Bidirectional:  [I] [went] [to] [the] [MASK] [to] [deposit] [money] → predicts "bank"

Architecture Overview

BERT is built entirely on the Transformer encoder architecture. The original paper introduced two model sizes:

Model	Layers	Hidden Size	Attention Heads	Parameters
BERT-Base	12	768	12	110M
BERT-Large	24	1024	16	340M

Input Representation

BERT’s input is the sum of three embeddings:

Token Embeddings: WordPiece tokenization with a 30,000 token vocabulary
Segment Embeddings: Distinguishes between sentence A and sentence B for sentence-pair tasks
Position Embeddings: Learned positional encodings (max 512 tokens)

Special tokens:

[CLS]: Classification token, placed at the beginning
[SEP]: Separator token between sentences
[MASK]: Mask token for MLM pre-training

Pre-Training Objectives

BERT is pre-trained on two unsupervised tasks using large text corpora (BooksCorpus and English Wikipedia).

1. Masked Language Modeling (MLM)

Randomly mask 15% of input tokens and predict them:

80% of the time: Replace with [MASK]
10% of the time: Replace with a random token
10% of the time: Keep the original token

This prevents the model from simply learning to ignore [MASK] tokens.

Training objective:

$$ \mathcal{L}{\text{MLM}} = -\sum{i \in \mathcal{M}} \log P(x_i | x_{\backslash \mathcal{M}}) $$

Where $\mathcal{M}$ is the set of masked positions.

2. Next Sentence Prediction (NSP)

A binary classification task to predict whether sentence B follows sentence A:

50% of training pairs: B is the actual next sentence (label: IsNext)
50% of training pairs: B is a random sentence (label: NotNext)

This helps BERT understand sentence-level relationships for downstream tasks like question answering.²

Fine-Tuning for Downstream Tasks

One of BERT’s key strengths is its ability to be fine-tuned for specific tasks with minimal architectural changes:

Task Types and Approaches

Task Type	Example	Output Source
Single Sentence Classification	Sentiment Analysis	`[CLS]` token representation
Sentence Pair Classification	Natural Language Inference	`[CLS]` token representation
Question Answering	SQuAD	Start/end token positions
Token Classification	Named Entity Recognition	Per-token representations

Fine-Tuning Process

Initialize with pre-trained BERT weights
Add a task-specific output layer (typically a single linear layer)
Fine-tune all parameters end-to-end on labeled data
Typical training: 2-4 epochs, learning rate 2e-5 to 5e-5

BERT Variants and Extensions

The success of BERT spawned numerous variants addressing its limitations:

Variant	Innovation	Key Improvement
RoBERTa	Robust training	Removed NSP, larger batches, more data
ALBERT	Parameter efficiency	Factorized embeddings, cross-layer sharing
DistilBERT	Model compression	40% smaller, 60% faster, retains 97% performance
SpanBERT	Span masking	Better for extractive tasks
XLNet	Permutation language modeling	Captures bidirectional context without masks

Multilingual BERT (mBERT)

BERT trained on 104 languages simultaneously, enabling cross-lingual transfer learning where a model fine-tuned on English can perform well on other languages.³

Practical Implementation

Using BERT with Hugging Face Transformers

from transformers import BertTokenizer, BertModel

# Load pre-trained model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Tokenize input
text = "BERT revolutionized natural language processing."
inputs = tokenizer(text, return_tensors="pt")

# Get embeddings
outputs = model(**inputs)
last_hidden_states = outputs.last_hidden_state  # Shape: [batch, seq_len, hidden_size]
pooled_output = outputs.pooler_output           # Shape: [batch, hidden_size]

Computational Considerations

Memory: BERT’s self-attention has $O(n^2)$ memory complexity
Maximum sequence length: 512 tokens (architectural constraint)
Inference time: Significantly slower than traditional models
GPU requirements: Fine-tuning BERT-Base requires at least 12GB VRAM

Comparison with Other Models

Feature	BERT	GPT	ELMo
Architecture	Transformer Encoder	Transformer Decoder	BiLSTM
Context	Bidirectional	Unidirectional (left-to-right)	Bidirectional (concatenated)
Pre-training	MLM + NSP	Language Modeling	Language Modeling
Fine-tuning	End-to-end	Few-shot/Zero-shot	Feature extraction
Best For	Understanding tasks	Generation tasks	Feature extraction

Limitations

Fixed context window: Maximum 512 tokens limits processing of long documents
Computational cost: Large memory and compute requirements
Pre-training/fine-tuning gap: [MASK] tokens never appear during fine-tuning
NSP controversy: Later research (RoBERTa) showed NSP may not be necessary
Static embeddings: Once fine-tuned, embeddings do not adapt to new contexts

Transformers - The foundational architecture
Attention Mechanisms - Core component of BERT
Word2Vec - Earlier word embedding approach
Vector Embeddings - General embedding concepts
Semantic Search - Common BERT application

References

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL-HLT 2019. https://arxiv.org/abs/1810.04805 ↩︎
Hugging Face. (2024). BERT Documentation. Hugging Face Transformers. https://huggingface.co/docs/transformers/model_doc/bert ↩︎
Pires, T., Schlinger, E., & Garrette, D. (2019). How Multilingual is Multilingual BERT? ACL 2019. https://arxiv.org/abs/1906.01502 ↩︎