Vector Embeddings

A vector embedding is a numerical representation of data (text, images, audio) as a list of numbers (a vector) in a high-dimensional space. Machine learning models use embeddings to understand and process complex data by converting it into a format where mathematical operations can reveal relationships, similarities, and patterns.1


Why Embeddings Matter

Traditional approaches represent words as one-hot vectors where each word is a sparse vector with a single 1 and all other positions as 0. This approach has two major problems:

ApproachDimensionalitySemantic SimilaritySparsity
One-hot encodingVocabulary size (e.g., 50,000)None capturedExtremely sparse
Dense embeddings100-1024 dimensionsCapturedDense

Dense embeddings solve both problems by representing words in a lower-dimensional space where semantically similar words are close together.


Key Properties of Embeddings

Semantic Similarity

Words or concepts with similar meanings have similar vector representations. The distance between vectors (typically measured by cosine similarity) reflects semantic relatedness:

$$ \text{similarity}(A, B) = \cos(\theta) = \frac{A \cdot B}{|A| |B|} $$

Analogical Reasoning

Good embeddings capture relationships. The famous example:

$$ \text{king} - \text{man} + \text{woman} \approx \text{queen} $$

This works because the vector difference between “king” and “man” captures the concept of royalty, which when added to “woman” yields “queen”.


Types of Embeddings

Static Embeddings

Each word has a single fixed vector representation regardless of context.

MethodApproachTraining Objective
Word2VecNeural networkPredict surrounding words (CBOW) or word from context (Skip-gram)
GloVeMatrix factorizationModel word co-occurrence statistics
FastTextNeural networkLike Word2Vec but includes subword information

Limitation: “Bank” has the same embedding whether referring to a financial institution or a river bank.

Contextual Embeddings

Each word’s representation depends on its surrounding context.

ModelArchitectureKey Innovation
BERTTransformer encoderBidirectional context, masked language modeling
GPTTransformer decoderAutoregressive, left-to-right context
RoBERTaTransformer encoderImproved BERT training

Advantage: “Bank” gets different embeddings in “I deposited money at the bank” vs “I sat by the river bank”.


Embedding Dimensions

Common embedding sizes:

ModelDimensionsUse Case
Word2Vec100-300Traditional NLP
GloVe50-300Traditional NLP
BERT-base768General purpose
BERT-large1024Higher capacity
OpenAI ada-0021536Production applications
OpenAI text-embedding-3-large3072High-precision retrieval

Higher dimensions capture more nuance but require more storage and computation.


Applications

ApplicationHow Embeddings Are Used
Semantic SearchQuery and document embeddings compared for relevance
Recommendation systemsUser and item embeddings for similarity matching
ClusteringGroup similar documents or entities
ClassificationInput features for ML classifiers
Anomaly detectionIdentify outliers in embedding space

Creating Embeddings

Pre-trained Models

For most applications, use pre-trained embedding models:

# Using sentence-transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(["Hello world", "Hi there"])

Fine-tuning

For domain-specific applications, fine-tune embeddings on your data to improve performance on specialized vocabulary and concepts.


  • Word2Vec - Original neural network approach to word embeddings
  • GloVe - Global vectors from co-occurrence statistics
  • BERT - Contextual embeddings from Transformers
  • Semantic Search - Using embeddings for retrieval

References


  1. Mikolov, T., et al. (2013). Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781. https://arxiv.org/abs/1301.3781 ↩︎