Static embeddings (Word2Vec Skip-gram): 1. For each central word in a context window, predict surrounding words (Skip-gram) or vice versa (CBOW). 2. Training via negative sampling: for each true (word, context) pair, increase P(context|word) and decrease P(negative|word) for random negatives. 3. The resulting embedding vectors are the rows of the input-layer weight matrix.
Contextual embeddings (Transformer): 1. Token Embedding Layer: each token ID is mapped to a d_model-dimensional vector from a [|V| ร d_model] matrix. 2. A positional embedding (sinusoidal or RoPE) is added to each token embedding. 3. Successive Transformer layers contextually transform these representations โ each token "sees" all others via self-attention. 4. Output pooling: a CLS token, mean pooling or last-token pooling yields a sentence/document embedding.
Semantic search: 5. A query and documents are encoded into the embedding space using the same model. 6. Similarity is computed as cosine similarity (or dot product for normalised vectors) โ HNSW or IVF for ANN search.
ML models cannot operate directly on discrete objects (words, tokens, categories) โ they require a numerical representation of input. One-hot encoding creates very sparse, high-dimensional vectors with no semantic similarity information (cosine distance between any two words is zero). Embeddings solve both problems: they represent objects as dense vectors in a continuous space where geometric proximity reflects semantic similarity.
At very high dimensions (e.g. 4096D) cosine differences between vectors shrink โ all vectors appear similar. Requires normalization and optional dimensionality reduction (PCA, UMAP).
An embedding model trained on general text (e.g. Wikipedia) generates poor representations for specialized text (medicine, law, code). Requires fine-tuning or a dedicated model.
Embeddings are static after generation โ changing a source document does not automatically update its embedding in the vector store. Requires an invalidation and re-embedding system.
Standard benchmarks: word-analogy (Google Analogy, 19,558 pairs; Mikolov 2013), word-similarity (WordSim-353, SimLex-999), MTEB (Massive Text Embedding Benchmark, ~58 tasks). Word2Vec 300-D reaches ~72% top-1 on Google Analogy. SBERT improves the Spearman correlation on the STS Benchmark to ~0.85. In MTEB (2024) leading models (Cohere Embed v3, OpenAI text-embedding-3-large, BGE-M3) score >65 on average, while classical TF-IDF and Word2Vec-mean trail significantly (~40).
Generating embeddings for large document collections (batch encoding) is significantly faster on GPU โ models like text-embedding-3 or BGE-M3 leverage CUDA.