SSL defines a pretext task whose label is a function of the input data itself. Common families: (1) generative / predictive — part of the input is hidden and the model learns to reconstruct it (BERT, GPT, MAE); (2) contrastive — two augmented views of the same sample should have similar representations while different samples should be pushed apart (SimCLR, MoCo); (3) self-distillation — a student network learns to match a teacher network without labels (BYOL, DINO). After pretraining the representations are transferred to downstream tasks via fine-tuning, linear probing, or prompt-based usage.
Classical supervised learning requires huge amounts of manually labeled data, which is expensive and does not scale to every domain. SSL leverages effectively unlimited unlabeled data (web text, video, images, sensor streams) to learn general-purpose representations.
Without sufficient negatives or a stop-gradient, the model may collapse to a constant representation.
In contrastive SSL the choice of augmentations (cropping, color jitter) affects results more than the architecture.
Web-scraped data may contain evaluation benchmarks or unwanted content — contaminating evaluation and model safety.
Mikolov et al. show that context prediction (CBOW / Skip-gram) on unlabeled text yields useful word representations.
First widely cited SSL works in vision — predicting the relative position of image patches as a pretext task.
Devlin et al. introduce MLM + NSP as a universal Transformer pretraining recipe; SSL becomes the dominant NLP paradigm.
Chen et al. (SimCLR) and He et al. (MoCo) demonstrate that contrastive learning on image augmentations matches supervised ImageNet representations.
A position piece by Yann LeCun and Ishan Misra framing SSL as the foundation of general intelligence.
He et al. show that simply masking 75% of image patches and reconstructing pixels yields strong visual representations — the vision analog of BERT.
Meta releases DINOv2: SSL on 142M images yields general-purpose representations competitive with specialized supervised models.
Type of pretext task: masked language modeling, next-token prediction, contrastive, self-distillation, masked image modeling.
Fraction of tokens / patches hidden in the masking task.
Strategy for generating multiple views of a sample — critical for contrastive methods.
In contrastive methods, large batches provide more negative examples and substantially affect quality.
The SSL paradigm itself does not dictate execution mode — it can be applied to dense (BERT, GPT) or sparse (MoE) models. Pretraining is typically done in dense mode.
SSL pretraining scales massively via data-parallel and model-parallel approaches. The loss is local (per-token / per-sample), so computation is highly parallelizable.
SSL pretraining requires massive fp16/bf16 FLOPs — GPUs with Tensor Cores (A100/H100) are the standard.
TPU v4/v5 are used by Google for large-scale SSL pretraining (PaLM, Gemini).