Word2Vec

Word2Vec is a family of neural network models that learn word embeddings by predicting words from their context (or vice versa). Introduced by Mikolov et al. at Google in 2013, Word2Vec demonstrated that simple neural networks trained on large text corpora could learn rich semantic representations.1


Key Insight

The core idea is the distributional hypothesis: words that appear in similar contexts have similar meanings. Word2Vec operationalizes this by training a neural network to predict context from words or words from context.


Architectures

Continuous Bag of Words (CBOW)

CBOW predicts a target word from its surrounding context words.

Input: Context words (e.g., “the”, “cat”, “on”, “the”, “mat”) Output: Target word (e.g., “sat”)

The model averages the embeddings of context words and predicts the center word:

$$ \hat{y} = \text{softmax}(W’ \cdot \frac{1}{2c} \sum_{i=-c, i \neq 0}^{c} W \cdot x_{t+i}) $$

where $c$ is the context window size.

Characteristics:

  • Faster to train
  • Better for frequent words
  • Smooths over distributional information

Skip-gram

Skip-gram predicts context words from a target word (opposite of CBOW).

Input: Target word (e.g., “sat”)
Output: Context words (e.g., “the”, “cat”, “on”, “the”, “mat”)

$$ P(w_{context} | w_{target}) = \frac{\exp(v’{w{context}} \cdot v_{w_{target}})}{\sum_{w \in V} \exp(v’w \cdot v{w_{target}})} $$

Characteristics:

  • Works better for rare words
  • Produces higher quality embeddings for infrequent terms
  • More computationally expensive

Architecture Comparison

AspectCBOWSkip-gram
Prediction directionContext → WordWord → Context
Training speedFasterSlower
Rare wordsWorseBetter
Frequent wordsBetterGood
Memory usageLowerHigher

Training Optimizations

Negative Sampling

Computing the full softmax over a large vocabulary is expensive. Negative sampling approximates it by:

  1. For each positive (target, context) pair, sample $k$ negative examples
  2. Train a binary classifier to distinguish positive from negative pairs

$$ \log \sigma(v’{w_O} \cdot v{w_I}) + \sum_{i=1}^{k} \mathbb{E}{w_i \sim P_n(w)} [\log \sigma(-v’{w_i} \cdot v_{w_I})] $$

Subsampling Frequent Words

Common words like “the”, “a”, “is” provide less information. Subsampling discards them with probability:

$$ P(w_i) = 1 - \sqrt{\frac{t}{f(w_i)}} $$

where $f(w_i)$ is the word frequency and $t$ is a threshold (typically $10^{-5}$).


Properties of Word2Vec Embeddings

Analogical Reasoning

Word2Vec embeddings capture semantic relationships through vector arithmetic:

RelationshipExample
Genderking - man + woman ≈ queen
CapitalParis - France + Italy ≈ Rome
Tensewalking - walk + swim ≈ swimming
Comparativebigger - big + small ≈ smaller

Clustering

Similar words cluster together in the embedding space:

  • Countries cluster together
  • Verbs cluster by tense
  • Adjectives cluster by sentiment

Limitations

LimitationDescription
Static embeddingsOne vector per word regardless of context
Out-of-vocabularyCannot handle words not seen during training
Polysemy“bank” (financial) and “bank” (river) have the same embedding
Training data biasEmbeddings reflect biases in training corpus

Comparison with Other Methods

MethodTypeContextKey Difference
Word2VecNeuralLocal (window)Predictive model
GloVeMatrix factorizationGlobal (corpus-wide)Co-occurrence statistics
BERTTransformerBidirectionalContextual embeddings

  • Vector Embeddings - Overview of embedding techniques
  • GloVe - Alternative using global co-occurrence
  • BERT - Contextual embeddings from Transformers
  • Semantic Search - Applications of embeddings

References


  1. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781. https://arxiv.org/abs/1301.3781 ↩︎