score(D,Q) = Σ IDF(qi) · ( f(qi,D)·(k1+1) ) / ( f(qi,D) + k1·(1 - b + b·|D|/avgdl) ). f(qi,D) = frequency of term qi in document D, |D| = length of D, avgdl = average document length. k1 (typically 1.2–2.0) controls TF saturation, b (typically 0.75) the strength of length normalisation. IDF uses the probabilistic variant.
In TF-IDF, term frequency grows linearly (100 occurrences = 100× weight) and does not consistently account for document length. BM25 saturates the contribution of additional occurrences and penalises unnaturally long documents.
The term ( f·(k1+1) ) / ( f + k1·... ) makes the term-frequency contribution grow asymptotically instead of linearly.
The factor (1 - b + b·|D|/avgdl) penalises documents longer than average, limiting their artificial advantage.
IDF(t) = log((N - df + 0.5)/(df + 0.5) + 1) — a variant derived from the probabilistic relevance model.
Official
Defaults k1=1.2, b=0.75 are not optimal for every corpus (e.g. short titles vs long articles).
BM25 still matches lexically; synonyms and paraphrases are not captured.
Robertson and Spärck Jones formulate the probabilistic retrieval model — the theoretical basis of BM25.
The full BM25 formula presented in the Okapi system at the TREC-3 conference.
Robertson and Zaragoza systematise the BM25 family (including BM25F for fields) in a comprehensive survey.
BEIR shows that BM25 remains competitive with dense retrievers in zero-shot retrieval.
Time complexity: O(|Q| · log N) z odwróconym indeksem. Space complexity: O(Σ nnz) odwrócony indeks.
Controls how quickly the term-frequency contribution saturates. Lower k1 = faster saturation.
Strength of document-length normalisation: 0 = none, 1 = full.
Traversing inverted-index posting lists is a CPU/IO-bound operation; no GPU benefit.