Vector Embeddings

A vector embedding is a numerical representation of data (text, images, audio) as a list of numbers (a vector) in a high-dimensional space. Machine learning models use embeddings to understand and process complex data by converting it into a format where mathematical operations can reveal relationships, similarities, and patterns.¹

Why Embeddings Matter

Traditional approaches represent words as one-hot vectors where each word is a sparse vector with a single 1 and all other positions as 0. This approach has two major problems:

Approach	Dimensionality	Semantic Similarity	Sparsity
One-hot encoding	Vocabulary size (e.g., 50,000)	None captured	Extremely sparse
Dense embeddings	100-1024 dimensions	Captured	Dense

Dense embeddings solve both problems by representing words in a lower-dimensional space where semantically similar words are close together.

Key Properties of Embeddings

Semantic Similarity

Words or concepts with similar meanings have similar vector representations. The distance between vectors (typically measured by cosine similarity) reflects semantic relatedness:

$$ \text{similarity}(A, B) = \cos(\theta) = \frac{A \cdot B}{|A| |B|} $$

Analogical Reasoning

Good embeddings capture relationships. The famous example:

$$ \text{king} - \text{man} + \text{woman} \approx \text{queen} $$

This works because the vector difference between “king” and “man” captures the concept of royalty, which when added to “woman” yields “queen”.

Types of Embeddings

Static Embeddings

Each word has a single fixed vector representation regardless of context.

Method	Approach	Training Objective
Word2Vec	Neural network	Predict surrounding words (CBOW) or word from context (Skip-gram)
GloVe	Matrix factorization	Model word co-occurrence statistics
FastText	Neural network	Like Word2Vec but includes subword information

Limitation: “Bank” has the same embedding whether referring to a financial institution or a river bank.

Contextual Embeddings

Each word’s representation depends on its surrounding context.

Model	Architecture	Key Innovation
BERT	Transformer encoder	Bidirectional context, masked language modeling
GPT	Transformer decoder	Autoregressive, left-to-right context
RoBERTa	Transformer encoder	Improved BERT training

Advantage: “Bank” gets different embeddings in “I deposited money at the bank” vs “I sat by the river bank”.

Embedding Dimensions

Common embedding sizes:

Model	Dimensions	Use Case
Word2Vec	100-300	Traditional NLP
GloVe	50-300	Traditional NLP
BERT-base	768	General purpose
BERT-large	1024	Higher capacity
OpenAI ada-002	1536	Production applications
OpenAI text-embedding-3-large	3072	High-precision retrieval

Higher dimensions capture more nuance but require more storage and computation.

Applications

Application	How Embeddings Are Used
Semantic Search	Query and document embeddings compared for relevance
Recommendation systems	User and item embeddings for similarity matching
Clustering	Group similar documents or entities
Classification	Input features for ML classifiers
Anomaly detection	Identify outliers in embedding space

Creating Embeddings

Pre-trained Models

For most applications, use pre-trained embedding models:

# Using sentence-transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(["Hello world", "Hi there"])

Fine-tuning

For domain-specific applications, fine-tune embeddings on your data to improve performance on specialized vocabulary and concepts.

Word2Vec - Original neural network approach to word embeddings
GloVe - Global vectors from co-occurrence statistics
BERT - Contextual embeddings from Transformers
Semantic Search - Using embeddings for retrieval

References

Mikolov, T., et al. (2013). Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781. https://arxiv.org/abs/1301.3781 ↩︎