Two architectures: CBOW (predicts a word from its context) and Skip-gram (predicts the context from a word). Training uses negative sampling or hierarchical softmax to avoid the cost of a full softmax over the vocabulary. After training, the hidden-layer vectors become word embeddings.
Sparse representations (one-hot, TF-IDF) treat words as independent symbols and capture neither synonymy nor semantic relations. Word2Vec learns dense vectors in which semantically similar words lie close together.
Architecture predicting the target word from the averaged context — faster, better for frequent words.
Official
Architecture predicting context words from the target word — better for rare words and small corpora.
Official
A softmax approximation: instead of normalising over the whole vocabulary, the model learns to distinguish true pairs from a few random negative ones.
Official
Word2Vec has no vector for words absent from the training corpus.
"Bank" (river / finance) gets a single averaged vector.
Bengio et al. introduce learned word representations in a neural language model — a precursor to word2vec.
Mikolov et al. publish CBOW and Skip-gram with efficient training — dense embeddings enter the mainstream.
Pennington et al. (Stanford) propose GloVe — embeddings based on global co-occurrence statistics.
Facebook AI releases FastText — character n-gram embeddings that handle out-of-vocabulary words.
Context-dependent embeddings displace static word2vec vectors in tasks requiring disambiguation.
Time complexity: O(C·E + E·log V) na próbkę (hierarchical softmax). Space complexity: O(V·E).
CBOW (faster, better for frequent words) vs Skip-gram (better for rare words).
Number of embedding dimensions — a trade-off between expressiveness and cost.
Number of words around the target treated as context.
The original (C) implementation is highly optimised for multi-threaded CPUs.
GPU training is possible, but the benefit is smaller than for deep models due to the shallow architecture.