Word2Vec
Word2Vec is a family of neural network models that learn word embeddings by predicting words from their context (or vice versa). Introduced by Mikolov et al. at Google in 2013, Word2Vec demonstrated that simple neural networks trained on large text corpora could learn rich semantic representations.1
Key Insight
The core idea is the distributional hypothesis: words that appear in similar contexts have similar meanings. Word2Vec operationalizes this by training a neural network to predict context from words or words from context.
Architectures
Continuous Bag of Words (CBOW)
CBOW predicts a target word from its surrounding context words.
Input: Context words (e.g., “the”, “cat”, “on”, “the”, “mat”) Output: Target word (e.g., “sat”)
The model averages the embeddings of context words and predicts the center word:
$$ \hat{y} = \text{softmax}(W’ \cdot \frac{1}{2c} \sum_{i=-c, i \neq 0}^{c} W \cdot x_{t+i}) $$
where $c$ is the context window size.
Characteristics:
- Faster to train
- Better for frequent words
- Smooths over distributional information
Skip-gram
Skip-gram predicts context words from a target word (opposite of CBOW).
Input: Target word (e.g., “sat”)
Output: Context words (e.g., “the”, “cat”, “on”, “the”, “mat”)
$$ P(w_{context} | w_{target}) = \frac{\exp(v’{w{context}} \cdot v_{w_{target}})}{\sum_{w \in V} \exp(v’w \cdot v{w_{target}})} $$
Characteristics:
- Works better for rare words
- Produces higher quality embeddings for infrequent terms
- More computationally expensive
Architecture Comparison
| Aspect | CBOW | Skip-gram |
|---|---|---|
| Prediction direction | Context → Word | Word → Context |
| Training speed | Faster | Slower |
| Rare words | Worse | Better |
| Frequent words | Better | Good |
| Memory usage | Lower | Higher |
Training Optimizations
Negative Sampling
Computing the full softmax over a large vocabulary is expensive. Negative sampling approximates it by:
- For each positive (target, context) pair, sample $k$ negative examples
- Train a binary classifier to distinguish positive from negative pairs
$$ \log \sigma(v’{w_O} \cdot v{w_I}) + \sum_{i=1}^{k} \mathbb{E}{w_i \sim P_n(w)} [\log \sigma(-v’{w_i} \cdot v_{w_I})] $$
Subsampling Frequent Words
Common words like “the”, “a”, “is” provide less information. Subsampling discards them with probability:
$$ P(w_i) = 1 - \sqrt{\frac{t}{f(w_i)}} $$
where $f(w_i)$ is the word frequency and $t$ is a threshold (typically $10^{-5}$).
Properties of Word2Vec Embeddings
Analogical Reasoning
Word2Vec embeddings capture semantic relationships through vector arithmetic:
| Relationship | Example |
|---|---|
| Gender | king - man + woman ≈ queen |
| Capital | Paris - France + Italy ≈ Rome |
| Tense | walking - walk + swim ≈ swimming |
| Comparative | bigger - big + small ≈ smaller |
Clustering
Similar words cluster together in the embedding space:
- Countries cluster together
- Verbs cluster by tense
- Adjectives cluster by sentiment
Limitations
| Limitation | Description |
|---|---|
| Static embeddings | One vector per word regardless of context |
| Out-of-vocabulary | Cannot handle words not seen during training |
| Polysemy | “bank” (financial) and “bank” (river) have the same embedding |
| Training data bias | Embeddings reflect biases in training corpus |
Comparison with Other Methods
| Method | Type | Context | Key Difference |
|---|---|---|---|
| Word2Vec | Neural | Local (window) | Predictive model |
| GloVe | Matrix factorization | Global (corpus-wide) | Co-occurrence statistics |
| BERT | Transformer | Bidirectional | Contextual embeddings |
Related Topics
- Vector Embeddings - Overview of embedding techniques
- GloVe - Alternative using global co-occurrence
- BERT - Contextual embeddings from Transformers
- Semantic Search - Applications of embeddings
References
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781. https://arxiv.org/abs/1301.3781 ↩︎