Architecture

Embeddings (vector representations)

mature

How it works

Static embeddings (Word2Vec Skip-gram): 1. For each central word in a context window, predict surrounding words (Skip-gram) or vice versa (CBOW). 2. Training via negative sampling: for each true (word, context) pair, increase P(context|word) and decrease P(negative|word) for random negatives. 3. The resulting embedding vectors are the rows of the input-layer weight matrix.

Contextual embeddings (Transformer): 1. Token Embedding Layer: each token ID is mapped to a d_model-dimensional vector from a [|V| × d_model] matrix. 2. A positional embedding (sinusoidal or RoPE) is added to each token embedding. 3. Successive Transformer layers contextually transform these representations — each token "sees" all others via self-attention. 4. Output pooling: a CLS token, mean pooling or last-token pooling yields a sentence/document embedding.

Semantic search: 5. A query and documents are encoded into the embedding space using the same model. 6. Similarity is computed as cosine similarity (or dot product for normalised vectors) — HNSW or IVF for ANN search.

Problem solved

ML models cannot operate directly on discrete objects (words, tokens, categories) — they require a numerical representation of input. One-hot encoding creates very sparse, high-dimensional vectors with no semantic similarity information (cosine distance between any two words is zero). Embeddings solve both problems: they represent objects as dense vectors in a continuous space where geometric proximity reflects semantic similarity.

Key mechanisms

Embedding matrix: a table of vectors of shape [|V|, d], indexed by token ID

Cosine similarity and Euclidean distance as semantic measures

Contrastive training — pull similar objects together, push dissimilar apart

Word2Vec Skip-gram: predict context from a central word

Word2Vec CBOW: predict the central word from context

Negative sampling — efficient approximation of softmax over a large vocab

GloVe: factorization of a global co-occurrence matrix (log-bilinear)

Contextual embeddings — the vector comes from hidden-layer activations in a given context

Pooling (mean, CLS, last-token) to obtain a sentence/document embedding

Matryoshka — embeddings with nested dimensionality levels

Strengths & limitations

Strengths

✓Continuous, dense representations — efficient memory and vector operations

✓Preserve semantic relations (synonymy, analogies)

✓Universal — apply to text, image, audio, graph, multimodal data

✓Pre-trained embeddings drastically reduce downstream-data requirements

✓Operations on embeddings (similarity search) are very fast — dot product

✓Scalability — vector databases handle billions of embeddings at <100 ms latency

✓Compositional — arithmetic operations have semantic interpretation

Limitations

✗Static embeddings cannot handle polysemy (one meaning per word)

✗Sensitive to training data — they reflect social and cultural biases

✗Out-of-vocabulary — Word2Vec/GloVe cannot represent unseen words

✗High dimensionality increases memory cost (e.g. 1536-D × millions of documents)

✗Cosine similarity is not a perfect semantic measure — sensitive to anisotropy

✗Contextual embeddings require costly pre-training of a large model

✗Individual dimensions are not interpretable

✗Embeddings from different models are not interchangeable (different spaces)

Implementation

Implementation pitfalls

Curse of dimensionality in cosine similarityMedium

At very high dimensions (e.g. 4096D) cosine differences between vectors shrink — all vectors appear similar. Requires normalization and optional dimensionality reduction (PCA, UMAP).

Domain shift — embeddings from one domain do not transfer to anotherMedium

An embedding model trained on general text (e.g. Wikipedia) generates poor representations for specialized text (medicine, law, code). Requires fine-tuning or a dedicated model.

No embedding refresh after data changesMedium

Embeddings are static after generation — changing a source document does not automatically update its embedding in the vector store. Requires an invalidation and re-embedding system.

Evolution

Original paper · 2013 · Tomas Mikolov

Efficient Estimation of Word Representations in Vector Space (Word2Vec)

Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean

1986

Rumelhart, Hinton and Williams introduce distributed representations as an alternative to one-hot encoding.

2003

Bengio et al. publish "A Neural Probabilistic Language Model" — the first neural language model with learned word embeddings.

2013

Mikolov et al. release Word2Vec (CBOW + Skip-gram) — embeddings become a standard NLP tool.

2013

Mikolov et al. publish the Word2Vec extension with negative sampling and hierarchical softmax — drastic training speedup.

2014

Pennington, Socher, Manning publish GloVe (Global Vectors) — embeddings based on a global co-occurrence matrix.

2016

Bojanowski et al. (Facebook AI) release fastText — n-gram embeddings that handle OOV words.

2018

ELMo (Peters et al.) and BERT (Devlin et al.) introduce contextual embeddings — a vector that depends on the surrounding context.

2019

Reimers and Gurevych publish Sentence-BERT (SBERT) — efficient sentence embeddings for retrieval and clustering.

2021

OpenAI releases its first public embedding API (text-embedding-ada-001/002) — the start of the commercial embedding-model era.

2024

Matryoshka Representation Learning (Kusupati et al.) and OpenAI text-embedding-3 expose adjustable-dimension embeddings.

Sources

Efficient Estimation of Word Representations in Vector Space (Word2Vec)

Paper

arXiv (Mikolov et al. 2013)

Distributed Representations of Words and Phrases and their Compositionality

Paper

arXiv (Mikolov et al. 2013)

GloVe: Global Vectors for Word Representation

Paper

Stanford NLP (Pennington, Socher, Manning 2014)

Enriching Word Vectors with Subword Information (fastText)

Paper

arXiv (Bojanowski et al. 2016)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Paper

arXiv (Devlin et al. 2018)

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Paper

arXiv (Reimers & Gurevych 2019)

MTEB Leaderboard — Massive Text Embedding Benchmark

Documentation

Hugging Face

OpenAI Embeddings Guide

Documentation

OpenAI

Embeddings (vector representations)

How it works

Problem solved

Key mechanisms

Strengths & limitations

Implementation

Evolution

Sources

Computational complexity

Hardware requirements