Word2Vec

Word2Vec is a family of neural network models that learn word embeddings by predicting words from their context (or vice versa). Introduced by Mikolov et al. at Google in 2013, Word2Vec demonstrated that simple neural networks trained on large text corpora could learn rich semantic representations.¹

Key Insight

The core idea is the distributional hypothesis: words that appear in similar contexts have similar meanings. Word2Vec operationalizes this by training a neural network to predict context from words or words from context.

Architectures

Continuous Bag of Words (CBOW)

CBOW predicts a target word from its surrounding context words.

Input: Context words (e.g., “the”, “cat”, “on”, “the”, “mat”) Output: Target word (e.g., “sat”)

The model averages the embeddings of context words and predicts the center word:

$$ \hat{y} = \text{softmax}(W’ \cdot \frac{1}{2c} \sum_{i=-c, i \neq 0}^{c} W \cdot x_{t+i}) $$

where $c$ is the context window size.

Characteristics:

Faster to train
Better for frequent words
Smooths over distributional information

Skip-gram

Skip-gram predicts context words from a target word (opposite of CBOW).

Input: Target word (e.g., “sat”)
Output: Context words (e.g., “the”, “cat”, “on”, “the”, “mat”)

$$ P(w_{context} | w_{target}) = \frac{\exp(v’{w{context}} \cdot v_{w_{target}})}{\sum_{w \in V} \exp(v’w \cdot v{w_{target}})} $$

Characteristics:

Works better for rare words
Produces higher quality embeddings for infrequent terms
More computationally expensive

Architecture Comparison

Aspect	CBOW	Skip-gram
Prediction direction	Context → Word	Word → Context
Training speed	Faster	Slower
Rare words	Worse	Better
Frequent words	Better	Good
Memory usage	Lower	Higher

Training Optimizations

Negative Sampling

Computing the full softmax over a large vocabulary is expensive. Negative sampling approximates it by:

For each positive (target, context) pair, sample $k$ negative examples
Train a binary classifier to distinguish positive from negative pairs

$$ \log \sigma(v’{w_O} \cdot v{w_I}) + \sum_{i=1}^{k} \mathbb{E}{w_i \sim P_n(w)} [\log \sigma(-v’{w_i} \cdot v_{w_I})] $$

Subsampling Frequent Words

Common words like “the”, “a”, “is” provide less information. Subsampling discards them with probability:

$$ P(w_i) = 1 - \sqrt{\frac{t}{f(w_i)}} $$

where $f(w_i)$ is the word frequency and $t$ is a threshold (typically $10^{-5}$).

Properties of Word2Vec Embeddings

Analogical Reasoning

Word2Vec embeddings capture semantic relationships through vector arithmetic:

Relationship	Example
Gender	king - man + woman ≈ queen
Capital	Paris - France + Italy ≈ Rome
Tense	walking - walk + swim ≈ swimming
Comparative	bigger - big + small ≈ smaller

Clustering

Similar words cluster together in the embedding space:

Countries cluster together
Verbs cluster by tense
Adjectives cluster by sentiment

Limitations

Limitation	Description
Static embeddings	One vector per word regardless of context
Out-of-vocabulary	Cannot handle words not seen during training
Polysemy	“bank” (financial) and “bank” (river) have the same embedding
Training data bias	Embeddings reflect biases in training corpus

Comparison with Other Methods

Method	Type	Context	Key Difference
Word2Vec	Neural	Local (window)	Predictive model
GloVe	Matrix factorization	Global (corpus-wide)	Co-occurrence statistics
BERT	Transformer	Bidirectional	Contextual embeddings

Vector Embeddings - Overview of embedding techniques
GloVe - Alternative using global co-occurrence
BERT - Contextual embeddings from Transformers
Semantic Search - Applications of embeddings

References

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781. https://arxiv.org/abs/1301.3781 ↩︎