1. Training: for each class C compute the prior P(C) as the class frequency in the training set. For each feature x_i compute the conditional probability P(x_i|C) from counts (multinomial) or distribution parameters (Gaussian). 2. Laplace smoothing: add a constant ฮฑ (default 1) to the numerator to avoid zero probabilities for unseen features. 3. Prediction: for a new example x apply the MAP rule โ choose the class that maximises P(C)ยทโP(x_i|C). In practice, log-sums are used to prevent underflow: log P(C) + ฮฃ log P(x_i|C). 4. Distribution variants: multinomial (token counts โ NLP), Bernoulli (binary features), Gaussian (continuous features โ assumes normality).
Classifying objects into categories requires a method that estimates the probability of class membership given observed features. Traditional approaches needed to model the full joint distribution of features โ computationally intractable in high dimensions. Naive Bayes addresses this by assuming conditional independence of features, reducing estimation complexity from exponential to linear.
If a test word did not appear in training, P(word|class)=0 zeros out the entire probability product. Requires smoothing (Laplace/add-k) โ without it the classifier is useless on new text.
Words in a sentence are strongly correlated ("not" before "good" changes semantics). Violating the independence assumption degrades probability calibration โ the model may classify correctly but with wrong confidences.
In text classification (e.g. 20 Newsgroups, Reuters-21578) multinomial Naive Bayes typically reaches 70โ85% accuracy, trailing logistic regression and SVMs by a few points on larger datasets. In spam filtering it historically achieved >95% precision, kickstarting the era of Bayesian email filters. It is still used as an NLP baseline in research papers.