Architecture

Naive Bayes

Historical

How it works

1. Training: for each class C compute the prior P(C) as the class frequency in the training set. For each feature x_i compute the conditional probability P(x_i|C) from counts (multinomial) or distribution parameters (Gaussian). 2. Laplace smoothing: add a constant α (default 1) to the numerator to avoid zero probabilities for unseen features. 3. Prediction: for a new example x apply the MAP rule — choose the class that maximises P(C)·∏P(x_i|C). In practice, log-sums are used to prevent underflow: log P(C) + Σ log P(x_i|C). 4. Distribution variants: multinomial (token counts — NLP), Bernoulli (binary features), Gaussian (continuous features — assumes normality).

Problem solved

Classifying objects into categories requires a method that estimates the probability of class membership given observed features. Traditional approaches needed to model the full joint distribution of features — computationally intractable in high dimensions. Naive Bayes addresses this by assuming conditional independence of features, reducing estimation complexity from exponential to linear.

Key mechanisms

Bayes' theorem: P(C|x) ∝ P(C)·∏P(x_i|C)

Assumption of conditional feature independence given the class

Estimation of prior and conditional probabilities from training-data frequencies

Laplace / Lidstone smoothing for unseen features

MAP (maximum a posteriori) decision — pick the class with the highest posterior

Distribution variants: Multinomial (counts), Bernoulli (binary), Gaussian (continuous)

Strengths & limitations

Strengths

✓Very fast training and prediction — linear in the number of features

✓Minimal training-data requirements

✓Easy to implement and interpret

✓Strong text-classification performance despite the naive assumption

✓Native multi-class support

✓Low memory footprint

Limitations

✗The feature-independence assumption is almost always violated

✗Poorly calibrated posterior probabilities (extreme 0/1 values)

✗Zero-frequency problem — requires smoothing

✗Limited ability to model feature interactions

✗Gaussian NB assumes normality of continuous features, which rarely holds

✗Worse than discriminative models (logistic regression, SVM) when data is abundant

Implementation

Implementation pitfalls

Zero probability — unseen words zero out the entire distributionMedium

If a test word did not appear in training, P(word|class)=0 zeros out the entire probability product. Requires smoothing (Laplace/add-k) — without it the classifier is useless on new text.

Feature independence assumption rarely holds in languageMedium

Words in a sentence are strongly correlated ("not" before "good" changes semantics). Violating the independence assumption degrades probability calibration — the model may classify correctly but with wrong confidences.

Evolution

Original paper · 1961 · Marvin E. Maron

Automatic Indexing: An Experimental Inquiry

Marvin E. Maron

1961

Marvin Maron publishes "Automatic Indexing" — an early use of a Bayesian classifier for automatic document categorization.

1973

Duda and Hart formalize Naive Bayes in the classic textbook "Pattern Classification and Scene Analysis".

1997

Domingos and Pazzani publish "On the Optimality of the Simple Bayesian Classifier under Zero-One Loss", explaining why NB works well despite the violated independence assumption.

1998

McCallum and Nigam compare multinomial vs Bernoulli NB for text classification — the paper becomes a standard NLP reference.

2002

Paul Graham publishes "A Plan for Spam" — Bayesian spam filters become ubiquitous in email clients.