GloVe

GloVe (Global Vectors for Word Representation) is an unsupervised learning algorithm for obtaining vector representations of words. Developed by Pennington, Socher, and Manning at Stanford in 2014, GloVe combines the advantages of global matrix factorization methods (like LSA) with local context window methods (like Word2Vec).1


Key Insight

Unlike Word2Vec which learns from local context windows, GloVe explicitly leverages global word co-occurrence statistics from the entire corpus. The key insight is that word-word co-occurrence probabilities encode meaningful semantic relationships.


The Co-occurrence Matrix

GloVe starts by building a word-word co-occurrence matrix $X$, where:

  • $X_{ij}$ = number of times word $j$ appears in the context of word $i$
  • Context is typically defined by a sliding window

Ratio of Co-occurrence Probabilities

The brilliance of GloVe lies in observing that ratios of co-occurrence probabilities carry meaning:

P(k|ice)P(k|steam)P(k|ice) / P(k|steam)
k = solidhighlowlarge
k = gaslowhighsmall
k = waterhighhigh≈ 1
k = fashionlowlow≈ 1

The ratio distinguishes “ice” from “steam” based on related words.


The GloVe Objective

GloVe trains word vectors such that their dot product equals the logarithm of the words’ probability of co-occurrence:

$$ w_i^T \tilde{w}_j + b_i + \tilde{b}j = \log(X{ij}) $$

where:

  • $w_i$ and $\tilde{w}_j$ are word vectors
  • $b_i$ and $\tilde{b}_j$ are bias terms

Weighted Least Squares Objective

The full objective function is:

$$ J = \sum_{i,j=1}^{V} f(X_{ij}) (w_i^T \tilde{w}_j + b_i + \tilde{b}j - \log X{ij})^2 $$

where $f(x)$ is a weighting function that:

  • Gives lower weight to rare co-occurrences (noisy)
  • Caps the weight for very frequent co-occurrences

$$ f(x) = \begin{cases} (x/x_{max})^\alpha & \text{if } x < x_{max} \ 1 & \text{otherwise} \end{cases} $$

Typically $\alpha = 0.75$ and $x_{max} = 100$.


GloVe vs Word2Vec

AspectGloVeWord2Vec
Training signalGlobal co-occurrence matrixLocal context windows
ObjectiveWeighted least squaresCross-entropy (or negative sampling)
Statistics usedCorpus-widePer-sample
TrainingMatrix factorization styleNeural network style
EfficiencyEfficient on co-occurrenceEfficient on raw text

When to Use Each

ScenarioRecommendation
Large corpus, efficiency mattersWord2Vec with negative sampling
Capturing global semanticsGloVe
Domain-specific vocabularyEither, with domain corpus
Modern applicationsConsider contextual embeddings (BERT)

Pre-trained GloVe Vectors

Stanford provides pre-trained GloVe vectors:

CorpusVocabularyDimensionsSize
Wikipedia + Gigaword400K50, 100, 200, 300822MB
Common Crawl (42B tokens)1.9M3005.6GB
Common Crawl (840B tokens)2.2M3005.6GB
Twitter (27B tokens)1.2M25, 50, 100, 2001.4GB

Limitations

LimitationDescription
Static embeddingsSame vector regardless of context
Memory for co-occurrenceLarge vocabulary requires significant memory
Out-of-vocabularyCannot handle unseen words
Context-insensitive“bank” (financial) = “bank” (river)

  • Word2Vec - Neural network approach to embeddings
  • Vector Embeddings - Overview of embedding techniques
  • BERT - Contextual embeddings from Transformers
  • Semantic Search - Applications of embeddings

References


  1. Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global Vectors for Word Representation. EMNLP. https://nlp.stanford.edu/pubs/glove.pdf ↩︎