1) From a corpus, build a |V|×|V| co-occurrence matrix X, where |V| is vocabulary size and X_ij is the weighted count of word j appearing in the context window of word i (weighted by 1/distance). 2) Each word receives two vectors: a word vector w_i and a context vector w̃_j, plus biases b_i, b̃_j. 3) The model minimizes the weighted loss J = Σ_ij f(X_ij) · (w_i^T w̃_j + b_i + b̃_j − log X_ij)^2, where the weighting function f(x) = (x/x_max)^α for x < x_max, else 1 (typically x_max = 100, α = 0.75) — damping the influence of rare and very frequent pairs. 4) Training uses stochastic gradient descent (AdaGrad) over nonzero entries of X. 5) The final word embedding is the sum w_i + w̃_i.
Prior methods fell into two camps: global matrix factorization (LSA, HAL), which efficiently used corpus statistics but performed poorly on word analogy tasks, and local context-window methods (word2vec), which excelled on analogies but did not exploit full global statistics. GloVe unifies both — it learns representations directly from global co-occurrence counts while performing well on both analogy and similarity benchmarks.
A |V|×|V| matrix where X_ij is the weighted count of word j appearing in the context window of word i. Built once in a single pass over the corpus.
Two sets of vectors (plus biases b_i, b̃_j) trained jointly. The final word embedding is the sum w_i + w̃_i, which smooths noise and slightly improves results.
f(x) = (x/x_max)^α for x < x_max, otherwise 1. Typical values x_max=100, α=0.75. Damps the influence of very rare pairs (potential noise) and very frequent ones (e.g. stopwords).
Official
GloVe learns one vector per word from a fixed vocabulary. Words absent from the training corpus (rare terms, misspellings, neologisms) have no representation.
Each word has one vector regardless of sentence context. Polysemy (e.g. "bank" river vs. institution) is represented as an averaging of senses.
For very large corpora, matrix X may not fit in RAM. The official implementation uses disk-based shuffling and sparse storage but requires careful parameter tuning.
Results depend strongly on context window size, lowercasing, stopword removal, and tokenization. Reproducing paper numbers requires matching the original preprocessing.
Mikolov et al. release word2vec — local context-window methods for learning distributed word representations. Direct predecessor and competitor of GloVe.
Pennington, Socher, and Manning publish the paper and release pretrained vectors on Wikipedia+Gigaword, Common Crawl, and Twitter.
Bojanowski et al. introduce fastText, extending word2vec with character n-grams — addressing the out-of-vocabulary (OOV) problem to which both GloVe and word2vec are vulnerable.
ELMo (Peters et al. 2018) and BERT (Devlin et al. 2018) introduce contextual word representations where the embedding depends on the surrounding sentence. Static embeddings like GloVe gradually become secondary in NLP research.
Time complexity: O(|X|) ≈ O(|C|^0.8). Space complexity: O(|V|·d + |X|).
The initial corpus pass to build matrix X is one-shot, but for very large corpora (Common Crawl 840B tokens) it requires substantial memory and I/O. The training itself is typically faster than matrix construction.
Dimensionality of word vectors. The original paper tested 25–300; released pretrained vectors come in 50, 100, 200, 300.
Number of words to the left/right of the focus word counted for co-occurrence. Larger windows capture topical semantics; smaller windows capture syntactic features.
Cutoff in the weighting function f(x) — above x_max the weight is 1. Controls the influence of high-frequency pairs.
Exponent in the weighting function f(x) = (x/x_max)^α. The value 0.75 was found empirically optimal.
Number of epochs over the nonzero entries of matrix X (AdaGrad).
GloVe produces dense, static word vectors (one vector per word, context-independent) — in contrast to contextual embeddings (ELMo, BERT).
AdaGrad training can be parallelized over nonzero entries with asynchronous parameter updates (Hogwild!-style). Inference (vector lookup) is trivially parallel but requires no computation.
The official Stanford implementation is in C and optimized for CPU with OpenMP. GloVe training does not require GPUs — operations are mostly sparse vector updates.
Training can be ported to GPU (PyTorch/TensorFlow implementations exist), though gains are smaller than for dense neural models.