Architecture

fastText

2017ActivePublished

Key innovation

Representing a word as the sum of its character n-gram vectors (subword units) instead of a single vector per whole word — solving the out-of-vocabulary (OOV) problem and substantially improving embedding quality for morphologically rich languages.

How it works

1) Each word w is expanded to a subword representation: boundary markers are added (e.g. <where> → <wh, whe, her, ere, re>), and a bag of character n-grams G_w is formed for lengths minn to maxn (typically 3–6), plus the word itself. 2) Each n-gram g is assigned a vector z_g; the word vector is v_w = Σ_{g∈G_w} z_g. 3) Training proceeds as in word2vec — skip-gram or CBOW with negative sampling — but gradients are propagated to all n-grams composing the word. 4) To bound parameter count, n-grams are hashed into a fixed number of buckets via the hashing trick (Fowler-Noll-Vo, typically 2M buckets). 5) For OOV words at inference time, a vector is computed from the word's n-grams even if the word never appeared in training. 6) In text-classification mode (a separate model), sentences are represented as the mean of word+n-gram vectors, and a linear classifier with hierarchical softmax computes class probabilities — enabling training on millions of examples in minutes.

Problem solved

Word2vec and GloVe learn one vector per word from a closed vocabulary — out-of-vocabulary words (OOV: rare terms, misspellings, neologisms, inflected forms in languages like Polish, Finnish, Turkish) have no representation. Moreover, these models ignore the internal morphological structure of words: "running" and "runner" are independent tokens to them. fastText solves both problems via subword vector composition.

Components

Character n-gram bag G_wSubword representation of the word.

Set of character n-grams of the word with lengths minn..maxn, plus boundary markers < and >. For minn=3, maxn=6 the word "where" generates: <wh, whe, her, ere, re>, <whe, where, …, <where>.

Official

N-gram vectors z_gParameters learned during optimization; source of OOV capability.

Trainable vectors for each n-gram (after hashing to a bucket). The word vector is the sum v_w = Σ z_g over all g in G_w.

Hashing trick for n-gramsMemory bound and implicit regularization.

The number of unique n-grams grows combinatorially with corpus size. fastText hashes each n-gram with the Fowler-Noll-Vo function modulo bucket (typically 2M), sharing vectors among colliding n-grams.

Official

Skip-gram / CBOW objective with negative samplingGradient training signal.

Loss function inherited from word2vec — predicting context words (skip-gram) or the center word from context (CBOW), with negative sampling as the softmax estimator.

Official

Hierarchical softmax classifier (text classification mode)Classifier scalability.

In text-classification mode the output layer is a Huffman tree over classes, reducing softmax cost from O(K) to O(log K), where K is the number of classes. Critical when K reaches hundreds of thousands.

Official

Implementation

Reference implementations

facebookresearch/fastText (official)

C++ · Facebook AI Research (FAIR)

Official

fastText project page (pretrained vectors, docs)

— · Facebook AI Research (FAIR)

Official

Gensim FastText

Python · RaRe Technologies

Implementation pitfalls

N-gram hash collisionsMedium

The hashing trick maps n-grams to a finite bucket count (default 2M). With very large vocabularies, distinct n-grams share vectors, which may degrade embedding quality for rare subwords.

Fix:Increase the `bucket` parameter (more memory) or train on a focused in-domain corpus with bounded vocabulary.

Poor minn/maxn for the target languageMedium

Default 3–6 characters is fine for English. Languages with long morphemes (German, Finnish) benefit from longer n-grams; logographic languages (Chinese, Japanese) require different settings or upstream segmentation.

Fix:Tune minn/maxn per language; for CJK consider word/subword segmentation upstream of fastText.

Static embeddings — no contextHigh

Like word2vec and GloVe, fastText assigns one vector per word. Polysemy is not disambiguated.

Fix:For context-sensitive tasks prefer BERT/RoBERTa or contextual models.

Model sizeLow

FAIR pretrained vectors take several GB per language. Quantization (`fasttext quantize`) can compress them to MB range, at some quality cost.

Fix:Use built-in quantization (`fasttext quantize -input model.bin -output model -qnorm`).

Evolution

Original paper · 2017 · Piotr Bojanowski

Enriching Word Vectors with Subword Information

Piotr Bojanowski, Edouard Grave, Armand Joulin, Tomas Mikolov

2013

word2vec — direct predecessor

Inflection point

Mikolov et al. publish skip-gram and CBOW. fastText inherits these training objectives but applies them to character n-grams rather than whole words.

Word2Vec (concept)

2016

Preprint "Enriching Word Vectors with Subword Information"

Inflection point

arXiv:1607.04606 (July 2016) — Bojanowski, Grave, Joulin, Mikolov introduce the subword model. Concurrently, arXiv:1607.01759 — the fastText text classifier — is released.

Enriching Word Vectors with Subword Information (paper)

2017

TACL and EACL publications

The embedding paper appears in TACL 2017, the classifier in EACL 2017 ("Bag of Tricks for Efficient Text Classification").

2018

Pretrained vectors for 157 languages

Inflection point

FAIR releases 300-dim pretrained vectors on Common Crawl + Wikipedia for 157 languages — still the baseline standard for many low-resource languages.

2018

ELMo / BERT — contextual embeddings surpass static

ELMo and BERT introduce contextual word representations; fastText remains a strong baseline under low compute and for languages with limited NLP infrastructure.