BERT: Bidirectional Encoder Representations from Transformers

BERT, introduced by Google AI in 2018, represents a paradigm shift in natural language processing. Unlike previous models that processed text left-to-right or right-to-left, BERT simultaneously considers context from both directions, enabling a deeper understanding of language semantics.1


Core Innovation: Bidirectional Context

Traditional language models like GPT use unidirectional context (left-to-right), which limits their understanding. For example, in the sentence “I went to the bank to deposit money,” a left-to-right model processing “bank” cannot see “deposit money” to disambiguate meaning.

BERT solves this by using a masked language modeling approach that allows it to look at the entire sentence simultaneously:

Unidirectional: [I] [went] [to] [the] [bank] → predicts next word
Bidirectional:  [I] [went] [to] [the] [MASK] [to] [deposit] [money] → predicts "bank"

Architecture Overview

BERT is built entirely on the Transformer encoder architecture. The original paper introduced two model sizes:

ModelLayersHidden SizeAttention HeadsParameters
BERT-Base1276812110M
BERT-Large24102416340M

Input Representation

BERT’s input is the sum of three embeddings:

  1. Token Embeddings: WordPiece tokenization with a 30,000 token vocabulary
  2. Segment Embeddings: Distinguishes between sentence A and sentence B for sentence-pair tasks
  3. Position Embeddings: Learned positional encodings (max 512 tokens)

Special tokens:

  • [CLS]: Classification token, placed at the beginning
  • [SEP]: Separator token between sentences
  • [MASK]: Mask token for MLM pre-training

Pre-Training Objectives

BERT is pre-trained on two unsupervised tasks using large text corpora (BooksCorpus and English Wikipedia).

1. Masked Language Modeling (MLM)

Randomly mask 15% of input tokens and predict them:

  • 80% of the time: Replace with [MASK]
  • 10% of the time: Replace with a random token
  • 10% of the time: Keep the original token

This prevents the model from simply learning to ignore [MASK] tokens.

Training objective:

$$ \mathcal{L}{\text{MLM}} = -\sum{i \in \mathcal{M}} \log P(x_i | x_{\backslash \mathcal{M}}) $$

Where $\mathcal{M}$ is the set of masked positions.

2. Next Sentence Prediction (NSP)

A binary classification task to predict whether sentence B follows sentence A:

  • 50% of training pairs: B is the actual next sentence (label: IsNext)
  • 50% of training pairs: B is a random sentence (label: NotNext)

This helps BERT understand sentence-level relationships for downstream tasks like question answering.2


Fine-Tuning for Downstream Tasks

One of BERT’s key strengths is its ability to be fine-tuned for specific tasks with minimal architectural changes:

Task Types and Approaches

Task TypeExampleOutput Source
Single Sentence ClassificationSentiment Analysis[CLS] token representation
Sentence Pair ClassificationNatural Language Inference[CLS] token representation
Question AnsweringSQuADStart/end token positions
Token ClassificationNamed Entity RecognitionPer-token representations

Fine-Tuning Process

  1. Initialize with pre-trained BERT weights
  2. Add a task-specific output layer (typically a single linear layer)
  3. Fine-tune all parameters end-to-end on labeled data
  4. Typical training: 2-4 epochs, learning rate 2e-5 to 5e-5

BERT Variants and Extensions

The success of BERT spawned numerous variants addressing its limitations:

VariantInnovationKey Improvement
RoBERTaRobust trainingRemoved NSP, larger batches, more data
ALBERTParameter efficiencyFactorized embeddings, cross-layer sharing
DistilBERTModel compression40% smaller, 60% faster, retains 97% performance
SpanBERTSpan maskingBetter for extractive tasks
XLNetPermutation language modelingCaptures bidirectional context without masks

Multilingual BERT (mBERT)

BERT trained on 104 languages simultaneously, enabling cross-lingual transfer learning where a model fine-tuned on English can perform well on other languages.3


Practical Implementation

Using BERT with Hugging Face Transformers

from transformers import BertTokenizer, BertModel

# Load pre-trained model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Tokenize input
text = "BERT revolutionized natural language processing."
inputs = tokenizer(text, return_tensors="pt")

# Get embeddings
outputs = model(**inputs)
last_hidden_states = outputs.last_hidden_state  # Shape: [batch, seq_len, hidden_size]
pooled_output = outputs.pooler_output           # Shape: [batch, hidden_size]

Computational Considerations

  • Memory: BERT’s self-attention has $O(n^2)$ memory complexity
  • Maximum sequence length: 512 tokens (architectural constraint)
  • Inference time: Significantly slower than traditional models
  • GPU requirements: Fine-tuning BERT-Base requires at least 12GB VRAM

Comparison with Other Models

FeatureBERTGPTELMo
ArchitectureTransformer EncoderTransformer DecoderBiLSTM
ContextBidirectionalUnidirectional (left-to-right)Bidirectional (concatenated)
Pre-trainingMLM + NSPLanguage ModelingLanguage Modeling
Fine-tuningEnd-to-endFew-shot/Zero-shotFeature extraction
Best ForUnderstanding tasksGeneration tasksFeature extraction

Limitations

  1. Fixed context window: Maximum 512 tokens limits processing of long documents
  2. Computational cost: Large memory and compute requirements
  3. Pre-training/fine-tuning gap: [MASK] tokens never appear during fine-tuning
  4. NSP controversy: Later research (RoBERTa) showed NSP may not be necessary
  5. Static embeddings: Once fine-tuned, embeddings do not adapt to new contexts

  • Transformers - The foundational architecture
  • Attention Mechanisms - Core component of BERT
  • Word2Vec - Earlier word embedding approach
  • Vector Embeddings - General embedding concepts
  • Semantic Search - Common BERT application

References


  1. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL-HLT 2019. https://arxiv.org/abs/1810.04805 ↩︎

  2. Hugging Face. (2024). BERT Documentation. Hugging Face Transformers. https://huggingface.co/docs/transformers/model_doc/bert ↩︎

  3. Pires, T., Schlinger, E., & Garrette, D. (2019). How Multilingual is Multilingual BERT? ACL 2019. https://arxiv.org/abs/1906.01502 ↩︎