Data

Word2Vec

2013ActivePublished

Key innovation

Showed that a shallow neural network trained on a context-prediction task learns vectors with algebraic properties (king - man + woman ≈ queen).

Category

Data

Abstraction level

Building block

Operation level

Data

Use cases

Initialising embeddings in NLP modelsSemantic search and word similarityRecommendation systems (item2vec)Clustering and semantic visualisation

How it works

Two architectures: CBOW (predicts a word from its context) and Skip-gram (predicts the context from a word). Training uses negative sampling or hierarchical softmax to avoid the cost of a full softmax over the vocabulary. After training, the hidden-layer vectors become word embeddings.

Problem solved

Sparse representations (one-hot, TF-IDF) treat words as independent symbols and capture neither synonymy nor semantic relations. Word2Vec learns dense vectors in which semantically similar words lie close together.

Components

CBOWTraining architecture

Architecture predicting the target word from the averaged context — faster, better for frequent words.

Official

Skip-gramTraining architecture

Architecture predicting context words from the target word — better for rare words and small corpora.

Official

Negative SamplingTraining objective

A softmax approximation: instead of normalising over the whole vocabulary, the model learns to distinguish true pairs from a few random negative ones.

Official

Implementation

Reference implementations

gensim Word2Vec

Python · RaRe Technologies

Original word2vec (C)

Implementation pitfalls

No handling of out-of-vocabulary (OOV) wordsHigh

Word2Vec has no vector for words absent from the training corpus.

Fix:Use FastText (character n-gram embeddings) or contextual embeddings.

One vector per word — no disambiguationMedium

"Bank" (river / finance) gets a single averaged vector.

Fix:Use contextual embeddings (BERT, ELMo) where in-context meaning matters.

Evolution

Original paper · 2013 · Tomas Mikolov

Efficient Estimation of Word Representations in Vector Space

Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean

2003

Neural Probabilistic Language Model (Bengio)

Bengio et al. introduce learned word representations in a neural language model — a precursor to word2vec.

2013

Word2Vec release

Inflection point

Mikolov et al. publish CBOW and Skip-gram with efficient training — dense embeddings enter the mainstream.

2014

GloVe as an alternative

Pennington et al. (Stanford) propose GloVe — embeddings based on global co-occurrence statistics.

2016

FastText solves the OOV problem

Facebook AI releases FastText — character n-gram embeddings that handle out-of-vocabulary words.

2018

Contextual embeddings (ELMo, BERT)

Inflection point

Context-dependent embeddings displace static word2vec vectors in tasks requiring disambiguation.

Sources

Distributed Representations of Words and Phrases and their Compositionality

Mikolov et al. (2013) — negative sampling and phrase vectors.