GloVe

GloVe (Global Vectors for Word Representation) is an unsupervised learning algorithm for obtaining vector representations of words. Developed by Pennington, Socher, and Manning at Stanford in 2014, GloVe combines the advantages of global matrix factorization methods (like LSA) with local context window methods (like Word2Vec).¹

Key Insight

Unlike Word2Vec which learns from local context windows, GloVe explicitly leverages global word co-occurrence statistics from the entire corpus. The key insight is that word-word co-occurrence probabilities encode meaningful semantic relationships.

The Co-occurrence Matrix

GloVe starts by building a word-word co-occurrence matrix $X$, where:

$X_{ij}$ = number of times word $j$ appears in the context of word $i$
Context is typically defined by a sliding window

Ratio of Co-occurrence Probabilities

The brilliance of GloVe lies in observing that ratios of co-occurrence probabilities carry meaning:

	P(k\|ice)	P(k\|steam)	P(k\|ice) / P(k\|steam)
k = solid	high	low	large
k = gas	low	high	small
k = water	high	high	≈ 1
k = fashion	low	low	≈ 1

The ratio distinguishes “ice” from “steam” based on related words.

The GloVe Objective

GloVe trains word vectors such that their dot product equals the logarithm of the words’ probability of co-occurrence:

$$ w_i^T \tilde{w}_j + b_i + \tilde{b}j = \log(X{ij}) $$

where:

$w_i$ and $\tilde{w}_j$ are word vectors
$b_i$ and $\tilde{b}_j$ are bias terms

Weighted Least Squares Objective

The full objective function is:

$$ J = \sum_{i,j=1}^{V} f(X_{ij}) (w_i^T \tilde{w}_j + b_i + \tilde{b}j - \log X{ij})^2 $$

where $f(x)$ is a weighting function that:

Gives lower weight to rare co-occurrences (noisy)
Caps the weight for very frequent co-occurrences

$$ f(x) = \begin{cases} (x/x_{max})^\alpha & \text{if } x < x_{max} \ 1 & \text{otherwise} \end{cases} $$

Typically $\alpha = 0.75$ and $x_{max} = 100$.

GloVe vs Word2Vec

Aspect	GloVe	Word2Vec
Training signal	Global co-occurrence matrix	Local context windows
Objective	Weighted least squares	Cross-entropy (or negative sampling)
Statistics used	Corpus-wide	Per-sample
Training	Matrix factorization style	Neural network style
Efficiency	Efficient on co-occurrence	Efficient on raw text

When to Use Each

Scenario	Recommendation
Large corpus, efficiency matters	Word2Vec with negative sampling
Capturing global semantics	GloVe
Domain-specific vocabulary	Either, with domain corpus
Modern applications	Consider contextual embeddings (BERT)

Pre-trained GloVe Vectors

Stanford provides pre-trained GloVe vectors:

Corpus	Vocabulary	Dimensions	Size
Wikipedia + Gigaword	400K	50, 100, 200, 300	822MB
Common Crawl (42B tokens)	1.9M	300	5.6GB
Common Crawl (840B tokens)	2.2M	300	5.6GB
Twitter (27B tokens)	1.2M	25, 50, 100, 200	1.4GB

Limitations

Limitation	Description
Static embeddings	Same vector regardless of context
Memory for co-occurrence	Large vocabulary requires significant memory
Out-of-vocabulary	Cannot handle unseen words
Context-insensitive	“bank” (financial) = “bank” (river)

Word2Vec - Neural network approach to embeddings
Vector Embeddings - Overview of embedding techniques
BERT - Contextual embeddings from Transformers
Semantic Search - Applications of embeddings

References

Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global Vectors for Word Representation. EMNLP. https://nlp.stanford.edu/pubs/glove.pdf ↩︎