How term frequency is computed: raw / log-scaled / boolean / sublinear. Sublinear (1 + log(tf)) dampens the effect of repeated occurrences of the same word.
Whether to add +1 in the denominator to avoid divide-by-zero for terms unseen in training (smooth IDF), and +1 to the whole result in scikit-learn.
Document vector norm after TF·IDF multiplication: l2 (unit length), l1 or none. L2 is the standard for cosine similarity.
Whether to treat single words (1,1), bigrams (1,2) or longer n-grams as terms. Wider range = richer features but exploding vocabulary.
Discards terms appearing in fewer than min_df or in more than max_df documents. Filters out typos / hapax legomena and stop-list-like words.
Time complexity: O(N · L) build, O(|q| · log N) query. Space complexity: O(N · |V|) gęsto, O(Σ nnz) rzadko.
TF-IDF is sparse algebra over integers / floats — CPUs with SIMD (AVX) and a good cache hierarchy are the natural target; no benefit from massive GPU parallelism.
The algorithm runs anywhere — from microcontrollers to clusters. Implementations exist in Python, Java, C++, Rust, JS.
Operations are sparse and mostly I/O-bound; tensor cores yield no meaningful speedup. GPUs are used mainly when TF-IDF feeds a model trained on GPU.
TF(t,d) = count of term t in document d / total words in d. IDF(t) = log(N / df(t)), where N = number of documents, df(t) = number of documents containing t. TF-IDF(t,d) = TF(t,d) × IDF(t). Resulting document vectors are sparse and can be used in retrieval and classification.
Bag-of-Words treats all words equally — words like "the", "is", "and" have high frequency but low informational value. TF-IDF gives lower weights to common words and higher weights to rare, document-specific ones.
Frequency of term t in document d. Common variants: raw count, length-normalised count, log(1+tf), boolean, sublinear scaling.
Official
Global factor penalising terms common across the corpus. Classic form: IDF(t) = log(N / df(t)); smoothed variants: log((N+1)/(df(t)+1))+1 (smooth IDF, scikit-learn) or log((N - df(t) + 0.5)/(df(t) + 0.5)) (probabilistic, BM25).
Official
After computing TF·IDF, document vectors are typically L2-normalised to unit length, enabling cosine similarity comparison independent of document length.
Official
IBM researcher Hans Peter Luhn proposes automatic document indexing based on word frequency — the foundation of the TF component.
"A Statistical Interpretation of Term Specificity and Its Application in Retrieval" introduces the idea that term specificity (IDF) should weight its document statistics.
"Term-weighting approaches in automatic text retrieval" systematises the family of TF·IDF formulas (SMART notation) still in use today.
Robertson et al. publish Okapi BM25 — a saturating TF variant with document length normalisation that displaces classic TF·IDF in full-text search engines.
Mikolov et al. release Word2Vec — dense semantic representations begin to displace TF·IDF in tasks requiring synonym understanding.
In Retrieval-Augmented Generation systems TF·IDF / BM25 return as sparse retrievers paired with dense embeddings (hybrid search).
TF-IDF cannot be computed incrementally — each new document changes the IDF of all terms. Dynamic corpora require periodic index rebuilding or approximate methods.
TF-IDF treats "car" and "automobile" as independent terms. For tasks requiring semantic matching (question answering, RAG) dense embeddings are a better choice.
Raw TF gives a word occurring 100 times 100× the weight of one occurring once, which rarely reflects relevance. BM25 fixes this via saturation (k1).
ngram_range=(1,3) without min_df / max_df can produce millions of features, most of which are typos or hapax legomena — the model overfits and the index bloats.
The query must go through the same tokenizer / stemmer / lowercasing as documents in the index. Different preprocessing = miss in the inverted index.