GloVe
GloVe (Global Vectors for Word Representation) is an unsupervised learning algorithm for obtaining vector representations of words. Developed by Pennington, Socher, and Manning at Stanford in 2014, GloVe combines the advantages of global matrix factorization methods (like LSA) with local context window methods (like Word2Vec).1
Key Insight
Unlike Word2Vec which learns from local context windows, GloVe explicitly leverages global word co-occurrence statistics from the entire corpus. The key insight is that word-word co-occurrence probabilities encode meaningful semantic relationships.
The Co-occurrence Matrix
GloVe starts by building a word-word co-occurrence matrix $X$, where:
- $X_{ij}$ = number of times word $j$ appears in the context of word $i$
- Context is typically defined by a sliding window
Ratio of Co-occurrence Probabilities
The brilliance of GloVe lies in observing that ratios of co-occurrence probabilities carry meaning:
| P(k|ice) | P(k|steam) | P(k|ice) / P(k|steam) | |
|---|---|---|---|
| k = solid | high | low | large |
| k = gas | low | high | small |
| k = water | high | high | ≈ 1 |
| k = fashion | low | low | ≈ 1 |
The ratio distinguishes “ice” from “steam” based on related words.
The GloVe Objective
GloVe trains word vectors such that their dot product equals the logarithm of the words’ probability of co-occurrence:
$$ w_i^T \tilde{w}_j + b_i + \tilde{b}j = \log(X{ij}) $$
where:
- $w_i$ and $\tilde{w}_j$ are word vectors
- $b_i$ and $\tilde{b}_j$ are bias terms
Weighted Least Squares Objective
The full objective function is:
$$ J = \sum_{i,j=1}^{V} f(X_{ij}) (w_i^T \tilde{w}_j + b_i + \tilde{b}j - \log X{ij})^2 $$
where $f(x)$ is a weighting function that:
- Gives lower weight to rare co-occurrences (noisy)
- Caps the weight for very frequent co-occurrences
$$ f(x) = \begin{cases} (x/x_{max})^\alpha & \text{if } x < x_{max} \ 1 & \text{otherwise} \end{cases} $$
Typically $\alpha = 0.75$ and $x_{max} = 100$.
GloVe vs Word2Vec
| Aspect | GloVe | Word2Vec |
|---|---|---|
| Training signal | Global co-occurrence matrix | Local context windows |
| Objective | Weighted least squares | Cross-entropy (or negative sampling) |
| Statistics used | Corpus-wide | Per-sample |
| Training | Matrix factorization style | Neural network style |
| Efficiency | Efficient on co-occurrence | Efficient on raw text |
When to Use Each
| Scenario | Recommendation |
|---|---|
| Large corpus, efficiency matters | Word2Vec with negative sampling |
| Capturing global semantics | GloVe |
| Domain-specific vocabulary | Either, with domain corpus |
| Modern applications | Consider contextual embeddings (BERT) |
Pre-trained GloVe Vectors
Stanford provides pre-trained GloVe vectors:
| Corpus | Vocabulary | Dimensions | Size |
|---|---|---|---|
| Wikipedia + Gigaword | 400K | 50, 100, 200, 300 | 822MB |
| Common Crawl (42B tokens) | 1.9M | 300 | 5.6GB |
| Common Crawl (840B tokens) | 2.2M | 300 | 5.6GB |
| Twitter (27B tokens) | 1.2M | 25, 50, 100, 200 | 1.4GB |
Limitations
| Limitation | Description |
|---|---|
| Static embeddings | Same vector regardless of context |
| Memory for co-occurrence | Large vocabulary requires significant memory |
| Out-of-vocabulary | Cannot handle unseen words |
| Context-insensitive | “bank” (financial) = “bank” (river) |
Related Topics
- Word2Vec - Neural network approach to embeddings
- Vector Embeddings - Overview of embedding techniques
- BERT - Contextual embeddings from Transformers
- Semantic Search - Applications of embeddings
References
Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global Vectors for Word Representation. EMNLP. https://nlp.stanford.edu/pubs/glove.pdf ↩︎