TF-IDF

Hyperparameters (configurable axes)

TF variantMedium

How term frequency is computed: raw / log-scaled / boolean / sublinear. Sublinear (1 + log(tf)) dampens the effect of repeated occurrences of the same word.

rawRaw occurrence count.

sublinear (1 + log tf)Default in many IR implementations.

booleanTerm presence / absence only.

IDF smoothingMedium

Whether to add +1 in the denominator to avoid divide-by-zero for terms unseen in training (smooth IDF), and +1 to the whole result in scikit-learn.

smooth_idf=TrueDefault in sklearn TfidfVectorizer.

smooth_idf=FalseClassic log(N/df) formula.

Vector normalisationMedium

Document vector norm after TF·IDF multiplication: l2 (unit length), l1 or none. L2 is the standard for cosine similarity.

l2Standard choice for cosine similarity.

noneRaw TF·IDF.

N-gram rangeHigh

Whether to treat single words (1,1), bigrams (1,2) or longer n-grams as terms. Wider range = richer features but exploding vocabulary.

(1, 1)Unigrams only — simplest baseline.

(1, 2)Unigrams + bigrams — popular in text classification.

Document frequency filtersHigh

Discards terms appearing in fewer than min_df or in more than max_df documents. Filters out typos / hapax legomena and stop-list-like words.

min_df=2, max_df=0.95Conservative production setting.

Computational complexity

Time complexity: O(N · L) build, O(|q| · log N) query. Space complexity: O(N · |V|) gęsto, O(Σ nnz) rzadko.

Execution paradigm

Primary mode

Sparse

Activation pattern

Subset active

Parallelism

Parallelism level

Fully parallel

Scope

TrainingInference

Hardware requirements

Primary

TF-IDF is sparse algebra over integers / floats — CPUs with SIMD (AVX) and a good cache hierarchy are the natural target; no benefit from massive GPU parallelism.

Good fit

The algorithm runs anywhere — from microcontrollers to clusters. Implementations exist in Python, Java, C++, Rust, JS.

Limited

Operations are sparse and mostly I/O-bound; tensor cores yield no meaningful speedup. GPUs are used mainly when TF-IDF feeds a model trained on GPU.

Components

Term Frequency (TF)Local (per-document) signal

Frequency of term t in document d. Common variants: raw count, length-normalised count, log(1+tf), boolean, sublinear scaling.

Official

Inverse Document Frequency (IDF)Global (per-corpus) signal

Global factor penalising terms common across the corpus. Classic form: IDF(t) = log(N / df(t)); smoothed variants: log((N+1)/(df(t)+1))+1 (smooth IDF, scikit-learn) or log((N - df(t) + 0.5)/(df(t) + 0.5)) (probabilistic, BM25).

Official

Document vector normalisationOutput scaling

After computing TF·IDF, document vectors are typically L2-normalised to unit length, enabling cosine similarity comparison independent of document length.

Official

Evolution

Original paper · 1972 · Karen Spärck Jones

A Statistical Interpretation of Term Specificity and Its Application in Retrieval

Karen Spärck Jones

1957

Hans Peter Luhn — TF as a foundation

IBM researcher Hans Peter Luhn proposes automatic document indexing based on word frequency — the foundation of the TF component.

1972

Karen Spärck Jones defines IDF

Inflection point

"A Statistical Interpretation of Term Specificity and Its Application in Retrieval" introduces the idea that term specificity (IDF) should weight its document statistics.

1988

Salton & Buckley — canonical TF·IDF variants

Inflection point

"Term-weighting approaches in automatic text retrieval" systematises the family of TF·IDF formulas (SMART notation) still in use today.

1994

BM25 as a probabilistic successor

Inflection point

Robertson et al. publish Okapi BM25 — a saturating TF variant with document length normalisation that displaces classic TF·IDF in full-text search engines.

2013

Word2Vec and the dense embeddings era

Mikolov et al. release Word2Vec — dense semantic representations begin to displace TF·IDF in tasks requiring synonym understanding.

2020

Renaissance in hybrid RAG

In Retrieval-Augmented Generation systems TF·IDF / BM25 return as sparse retrievers paired with dense embeddings (hybrid search).

Implementation

Reference implementations

scikit-learn TfidfVectorizer

Python · scikit-learn

Official

gensim TfidfModel

Python · RaRe Technologies

Official

Apache Lucene Similarity (ClassicSimilarity)

Java · Apache Software Foundation

Official

Elasticsearch / OpenSearch

Java · Elastic / OpenSearch project

Official

Implementation pitfalls

IDF requires the full corpus at index build timeMedium

TF-IDF cannot be computed incrementally — each new document changes the IDF of all terms. Dynamic corpora require periodic index rebuilding or approximate methods.

Fix:Use BM25 with approximate IDF statistics updated in batches, or the hashing trick (HashingVectorizer) for streaming.

No semantic understanding — synonyms treated as different termsHigh

TF-IDF treats "car" and "automobile" as independent terms. For tasks requiring semantic matching (question answering, RAG) dense embeddings are a better choice.

Fix:Combine TF-IDF / BM25 (sparse retrieval) with dense embeddings in a hybrid search architecture; consider query expansion via synonyms or stemming.

TF grows linearly — frequent words dominate the scoreMedium

Raw TF gives a word occurring 100 times 100× the weight of one occurring once, which rarely reflects relevance. BM25 fixes this via saturation (k1).

Fix:Use sublinear TF (1 + log tf) or switch to BM25.

Vocabulary explosion with n-grams and no df filtersHigh

ngram_range=(1,3) without min_df / max_df can produce millions of features, most of which are typos or hapax legomena — the model overfits and the index bloats.

Fix:Always set min_df ≥ 2 and max_df ≤ 0.95; consider HashingVectorizer for very large corpora.

Preprocessing mismatch between indexing and queryingHigh

The query must go through the same tokenizer / stemmer / lowercasing as documents in the index. Different preprocessing = miss in the inverted index.

Fix:Persist the full preprocessing pipeline (e.g. sklearn Pipeline) and serialise it together with the index.

Hyperparameters (configurable axes)

Computational complexity

Execution paradigm

Parallelism

Hardware requirements

How it works

Problem solved

Components

Evolution

Implementation

Sources