One specifies a parametric model f_θ, a loss function L measuring prediction quality on data, and an optimization algorithm (most commonly a variant of stochastic gradient descent) that searches for parameters θ minimizing L on the training set. Generalization — the model's ability to perform well on unseen data — is evaluated on validation and test sets and controlled through regularization, data augmentation, and capacity tuning. In supervised learning the data are (input, label) pairs; in unsupervised learning only inputs; in self-supervised learning labels are constructed automatically from the data itself (e.g. next-token prediction); in reinforcement learning an agent learns a policy that maximizes cumulative reward through interaction with an environment.
Many tasks — image and speech recognition, machine translation, robot control, recommendations — are effectively impossible to capture with hand-written rules because the rules are too complex, variable, or implicit even for experts. ML replaces manual rule engineering with pattern induction from large datasets.
A set of examples from which the model learns patterns. Quality, quantity, and representativeness of data are critical to model performance.
A parametric function f_θ mapping inputs to predictions. Can range from linear regression and decision trees to deep neural networks.
A scalar measure of discrepancy between model predictions and target outputs. Defines the optimization objective.
An algorithm that updates model parameters to minimize the loss function (e.g. SGD, Adam, AdamW, L-BFGS).
Partitioning of data into train, validation, and test splits with performance metrics (accuracy, F1, AUC, perplexity, etc.) used to assess generalization.
Information from the test set or future leaks into training (e.g. through bad splits, dataset-wide normalization, non-fold-aware target encoding). Produces artificially inflated metrics that collapse in production.
The model fits noise in the training data and loses generalization ability.
Production data deviates from the training distribution (covariate shift, label shift, concept drift), causing model degradation over time.
When one class dominates the data, the model learns to predict the majority class and ignores rare cases despite high accuracy.
Optimizing a metric misaligned with the business objective (e.g. accuracy on imbalanced data, MSE when quantiles matter) yields models that score well but are useless in deployment.
Arthur Samuel publishes work on a self-improving checkers-playing program at IBM, popularizing the concept of machine learning.
Rumelhart, Hinton, and Williams popularize the backpropagation algorithm, enabling training of deeper neural networks.
Cortes and Vapnik publish the SVM paper, which becomes one of the dominant ML methods of the 1990s and 2000s.
Leo Breiman formalizes Random Forests — a versatile ensemble method that dominates classical ML.
Hinton et al. show that deep networks can be trained effectively via layer-wise pretraining, opening the deep learning era.
Krizhevsky, Sutskever, and Hinton win ILSVRC 2012 by a large margin with a GPU-trained deep CNN — an inflection point for deep learning in computer vision.
Vaswani et al. publish "Attention Is All You Need", introducing the Transformer architecture that becomes the foundation of modern ML in language and beyond.
OpenAI releases GPT-3 (175B parameters), showing that sufficiently large language models exhibit few-shot learning abilities.
The release of ChatGPT moves ML from labs into daily use by hundreds of millions of people and triggers an industry-wide race around generative AI.
Coefficient controlling parameter update step size. Too large causes divergence; too small causes slow convergence.
Number of parameters, depth, or width of the model. Determines the trade-off between underfitting and overfitting.
L1/L2 coefficients, dropout, weight decay — counteract overfitting to training data.
Number of examples per gradient step. Affects training stability, generalization, and GPU memory utilization.
How many times the algorithm passes through the full training set. Too many leads to overfitting.
Most modern ML, especially deep learning, relies on massive matrix multiplications that GPUs with tensor cores perform orders of magnitude faster than CPUs.
Google's TPUs are purpose-built for the tensor operations that dominate ML training and inference.
For classical ML (trees, regression, SVM, small networks) and lightweight inference, CPUs with SIMD/AVX instructions remain practical and common.
FPGAs are sometimes used for specialized low-latency inference (e.g. trading, edge) but are not mainstream.