1) Each word w is expanded to a subword representation: boundary markers are added (e.g. <where> → <wh, whe, her, ere, re>), and a bag of character n-grams G_w is formed for lengths minn to maxn (typically 3–6), plus the word itself. 2) Each n-gram g is assigned a vector z_g; the word vector is v_w = Σ_{g∈G_w} z_g. 3) Training proceeds as in word2vec — skip-gram or CBOW with negative sampling — but gradients are propagated to all n-grams composing the word. 4) To bound parameter count, n-grams are hashed into a fixed number of buckets via the hashing trick (Fowler-Noll-Vo, typically 2M buckets). 5) For OOV words at inference time, a vector is computed from the word's n-grams even if the word never appeared in training. 6) In text-classification mode (a separate model), sentences are represented as the mean of word+n-gram vectors, and a linear classifier with hierarchical softmax computes class probabilities — enabling training on millions of examples in minutes.
Word2vec and GloVe learn one vector per word from a closed vocabulary — out-of-vocabulary words (OOV: rare terms, misspellings, neologisms, inflected forms in languages like Polish, Finnish, Turkish) have no representation. Moreover, these models ignore the internal morphological structure of words: "running" and "runner" are independent tokens to them. fastText solves both problems via subword vector composition.
Set of character n-grams of the word with lengths minn..maxn, plus boundary markers < and >. For minn=3, maxn=6 the word "where" generates: <wh, whe, her, ere, re>, <whe, where, …, <where>.
Official
Trainable vectors for each n-gram (after hashing to a bucket). The word vector is the sum v_w = Σ z_g over all g in G_w.
The number of unique n-grams grows combinatorially with corpus size. fastText hashes each n-gram with the Fowler-Noll-Vo function modulo bucket (typically 2M), sharing vectors among colliding n-grams.
Official
Loss function inherited from word2vec — predicting context words (skip-gram) or the center word from context (CBOW), with negative sampling as the softmax estimator.
Official
In text-classification mode the output layer is a Huffman tree over classes, reducing softmax cost from O(K) to O(log K), where K is the number of classes. Critical when K reaches hundreds of thousands.
Official
The hashing trick maps n-grams to a finite bucket count (default 2M). With very large vocabularies, distinct n-grams share vectors, which may degrade embedding quality for rare subwords.
Default 3–6 characters is fine for English. Languages with long morphemes (German, Finnish) benefit from longer n-grams; logographic languages (Chinese, Japanese) require different settings or upstream segmentation.
Like word2vec and GloVe, fastText assigns one vector per word. Polysemy is not disambiguated.
FAIR pretrained vectors take several GB per language. Quantization (`fasttext quantize`) can compress them to MB range, at some quality cost.
Mikolov et al. publish skip-gram and CBOW. fastText inherits these training objectives but applies them to character n-grams rather than whole words.
arXiv:1607.04606 (July 2016) — Bojanowski, Grave, Joulin, Mikolov introduce the subword model. Concurrently, arXiv:1607.01759 — the fastText text classifier — is released.
The embedding paper appears in TACL 2017, the classifier in EACL 2017 ("Bag of Tricks for Efficient Text Classification").
FAIR releases 300-dim pretrained vectors on Common Crawl + Wikipedia for 157 languages — still the baseline standard for many low-resource languages.
ELMo and BERT introduce contextual word representations; fastText remains a strong baseline under low compute and for languages with limited NLP infrastructure.
Time complexity: O(T · |G_w| · k · d). Space complexity: O((|V| + bucket) · d).
Each gradient update propagates the signal to all n-grams of a word (typically 10–20). This makes fastText slower than word2vec on the same corpus but still an order of magnitude faster than neural models.
Shortest character n-gram included in G_w. Default 3.
Longest character n-gram. Default 6. Setting maxn=0 disables subword units and reduces fastText to word2vec.
Dimensionality of word and n-gram vectors.
Hash bucket table size for n-grams. Larger = fewer collisions, more memory. Default 2,000,000.
Number of words on each side considered as context (skip-gram/CBOW). Default 5.
Passes over the corpus. Default 5.
SGD step size. Default 0.05 for skip-gram/CBOW, 0.1 for supervised.
Number of negative samples in negative sampling. Default 5.
Static embeddings (one vector per word, context-independent), compositional at the subword level. No attention or recurrence.
Hogwild!-style multithreaded training — asynchronous SGD updates without locks on shared parameters. Scales well to dozens of CPU cores.
The reference C++ implementation is CPU-only, multithreaded (Hogwild!-style), optimized for modern CPUs. fastText reaches hundreds of thousands of words/second per core.
Inference (vector lookup) is trivial and runs anywhere. PyTorch/TensorFlow GPU ports exist but rarely outperform the reference C++ implementation.