Other

ML

1959ActivePublished: 11 June 2026Updated: 11 June 2026Published

Key innovation

Instead of being explicitly programmed, systems learn to perform tasks by fitting model parameters to data through optimization of an error or likelihood objective.

How it works

One specifies a parametric model f_θ, a loss function L measuring prediction quality on data, and an optimization algorithm (most commonly a variant of stochastic gradient descent) that searches for parameters θ minimizing L on the training set. Generalization — the model's ability to perform well on unseen data — is evaluated on validation and test sets and controlled through regularization, data augmentation, and capacity tuning. In supervised learning the data are (input, label) pairs; in unsupervised learning only inputs; in self-supervised learning labels are constructed automatically from the data itself (e.g. next-token prediction); in reinforcement learning an agent learns a policy that maximizes cumulative reward through interaction with an environment.

Problem solved

Many tasks — image and speech recognition, machine translation, robot control, recommendations — are effectively impossible to capture with hand-written rules because the rules are too complex, variable, or implicit even for experts. ML replaces manual rule engineering with pattern induction from large datasets.

Components

Training dataSource of learning signal

A set of examples from which the model learns patterns. Quality, quantity, and representativeness of data are critical to model performance.

ModelHypothesis learned from data

A parametric function f_θ mapping inputs to predictions. Can range from linear regression and decision trees to deep neural networks.

Loss functionOptimization criterion

A scalar measure of discrepancy between model predictions and target outputs. Defines the optimization objective.

OptimizerLearning mechanism

An algorithm that updates model parameters to minimize the loss function (e.g. SGD, Adam, AdamW, L-BFGS).

Evaluation procedureGeneralization measurement

Partitioning of data into train, validation, and test splits with performance metrics (accuracy, F1, AUC, perplexity, etc.) used to assess generalization.

Implementation

Reference implementations

scikit-learn

Python · scikit-learn developers

Official

PyTorch

Python / C++ · PyTorch Foundation

Official

TensorFlow

Python / C++ · Google

Implementation pitfalls

Data leakageCritical

Information from the test set or future leaks into training (e.g. through bad splits, dataset-wide normalization, non-fold-aware target encoding). Produces artificially inflated metrics that collapse in production.

Fix:Perform all transformations inside an sklearn/torch pipeline applied after the split; use cross-validation; verify no feature encodes future information.

OverfittingHigh

The model fits noise in the training data and loses generalization ability.

Fix:Regularization (L1/L2, dropout, weight decay), early stopping on validation set, data augmentation, reducing model capacity.

Distribution shiftHigh

Production data deviates from the training distribution (covariate shift, label shift, concept drift), causing model degradation over time.

Fix:Monitor metrics and feature distributions in production, retrain regularly, detect drift, validate on fresh data.

Class imbalanceHigh

When one class dominates the data, the model learns to predict the majority class and ignores rare cases despite high accuracy.

Fix:Resampling (over/under, SMOTE), class-weighted loss, imbalance-aware metrics (F1, PR-AUC, minority-class recall).

Wrong evaluation metricsMedium

Optimizing a metric misaligned with the business objective (e.g. accuracy on imbalanced data, MSE when quantiles matter) yields models that score well but are useless in deployment.

Fix:Choose metrics driven by the cost of errors in the application; analyze confusion matrix and calibration; use task-specific metrics.

Evolution

Original paper · 1959 · IBM Journal of Research and Development · Arthur L. Samuel

Some Studies in Machine Learning Using the Game of Checkers

Arthur L. Samuel

1959

Samuel coins the term "machine learning"

Inflection point

Arthur Samuel publishes work on a self-improving checkers-playing program at IBM, popularizing the concept of machine learning.

1986

Backpropagation in neural networks

Inflection point

Rumelhart, Hinton, and Williams popularize the backpropagation algorithm, enabling training of deeper neural networks.

1995

Support Vector Machines (SVM)

Cortes and Vapnik publish the SVM paper, which becomes one of the dominant ML methods of the 1990s and 2000s.

2001

Random Forests

Leo Breiman formalizes Random Forests — a versatile ensemble method that dominates classical ML.

2006

Deep learning renaissance

Hinton et al. show that deep networks can be trained effectively via layer-wise pretraining, opening the deep learning era.

2012

AlexNet wins ImageNet

Inflection point

Krizhevsky, Sutskever, and Hinton win ILSVRC 2012 by a large margin with a GPU-trained deep CNN — an inflection point for deep learning in computer vision.

2017

Transformer architecture

Inflection point

Vaswani et al. publish "Attention Is All You Need", introducing the Transformer architecture that becomes the foundation of modern ML in language and beyond.

2020

Scaling language models: GPT-3

OpenAI releases GPT-3 (175B parameters), showing that sufficiently large language models exhibit few-shot learning abilities.

2022

ChatGPT and mass adoption

Inflection point

The release of ChatGPT moves ML from labs into daily use by hundreds of millions of people and triggers an industry-wide race around generative AI.