1. Find the hyperplane w·x + b = 0 maximising the margin 2/‖w‖ subject to y_i(w·x_i + b) ≥ 1. 2. Reformulate as a quadratic programming (QP) problem (or its Lagrangian dual), where support vectors — points on the margin — determine the solution. 3. Soft-margin (parameter C): allow margin violations via slack variables ξ_i ≥ 0. Large C → narrow margin, fewer training errors; small C → wide margin, more errors tolerated. 4. Kernel trick: replace the dot product with a kernel function K(x_i, x_j) = φ(x_i)·φ(x_j) without computing φ explicitly. Common kernels: RBF K = exp(−γ‖x−x'‖²), polynomial, sigmoid. 5. Prediction: sign(Σ α_i y_i K(x_i, x) + b), summing only over support vectors (α_i > 0). 6. Training: the SMO (Sequential Minimal Optimization) algorithm decomposes the QP into two-variable sub-problems solved analytically.
Linear classifiers find any separating hyperplane but provide no generalization guarantee — infinitely many valid planes exist. SVM solves this by selecting the maximum-margin hyperplane (maximizing distance to the nearest points of each class), minimizing the risk of error on unseen data. Non-linear decision boundaries are addressed by the kernel trick, which maps data into a higher-dimensional space without explicit transformation.
Standard SVM solvers (SMO, libsvm) have quadratic or cubic complexity w.r.t. sample count — for n>100k training is impractical. Alternatives: SGD-SVM, LinearSVC (O(n)).
Kernel choice (RBF, poly, linear) and parameters C, γ strongly affect results — no default values working in all cases. Grid search + CV is costly for large datasets.
Before the deep-learning era SVM dominated text classification (20 Newsgroups, Reuters-21578, RCV1) with 85–92% accuracy, clearly above Naive Bayes. On MNIST digit recognition a linear SVM reaches ~92% and an RBF-kernel SVM ~98.6%. In ImageNet 2010–2011 SVM was the standard until AlexNet (2012) was introduced. On small datasets (under 10k samples) SVM remains competitive with neural networks.