Training

SSL

ActivePublished: 9 June 2026Updated: 9 June 2026Published

Key innovation

Learning representations from raw, unlabeled data by generating the supervisory signal from the data itself — without manual labels.

How it works

SSL defines a pretext task whose label is a function of the input data itself. Common families: (1) generative / predictive — part of the input is hidden and the model learns to reconstruct it (BERT, GPT, MAE); (2) contrastive — two augmented views of the same sample should have similar representations while different samples should be pushed apart (SimCLR, MoCo); (3) self-distillation — a student network learns to match a teacher network without labels (BYOL, DINO). After pretraining the representations are transferred to downstream tasks via fine-tuning, linear probing, or prompt-based usage.

Problem solved

Classical supervised learning requires huge amounts of manually labeled data, which is expensive and does not scale to every domain. SSL leverages effectively unlimited unlabeled data (web text, video, images, sensor streams) to learn general-purpose representations.

Implementation

Reference implementations

PyTorch Lightning Bolts — SSL

Implementation pitfalls

Representational collapse in contrastive methodsHigh

Without sufficient negatives or a stop-gradient, the model may collapse to a constant representation.

Fix:Large batches with many negatives (SimCLR), momentum encoder (MoCo), predictor + stop-gradient (BYOL/SimSiam).

Augmentations are quality-criticalMedium

In contrastive SSL the choice of augmentations (cropping, color jitter) affects results more than the architecture.

Fix:Stick to vetted augmentation pipelines from reference papers (SimCLR, DINO).

Pretraining data contaminationHigh

Web-scraped data may contain evaluation benchmarks or unwanted content — contaminating evaluation and model safety.

Fix:Decontamination filters, dedup, blocklists, reproducible data manifests.

Evolution

2013

Word2Vec — distributional embeddings from unlabeled text

Inflection point

Mikolov et al. show that context prediction (CBOW / Skip-gram) on unlabeled text yields useful word representations.

Word2Vec (concept)

2015

Context Prediction in vision (Doersch et al.)

First widely cited SSL works in vision — predicting the relative position of image patches as a pretext task.

2018

BERT — Masked Language Modeling becomes the NLP pretraining standard

Inflection point

Devlin et al. introduce MLM + NSP as a universal Transformer pretraining recipe; SSL becomes the dominant NLP paradigm.

BERT (concept)

2020

SimCLR / MoCo — contrastive SSL for vision

Chen et al. (SimCLR) and He et al. (MoCo) demonstrate that contrastive learning on image augmentations matches supervised ImageNet representations.

2021

"Self-Supervised Learning: The Dark Matter of Intelligence" (LeCun, Misra)

A position piece by Yann LeCun and Ishan Misra framing SSL as the foundation of general intelligence.

Self-Supervised Learning: The Dark Matter of Intelligence (paper)

2021

MAE — Masked Autoencoders for vision

He et al. show that simply masking 75% of image patches and reconstructing pixels yields strong visual representations — the vision analog of BERT.

Masked Autoencoders Are Scalable Vision Learners (paper)

2023

DINOv2 — general-purpose visual SSL features

Meta releases DINOv2: SSL on 142M images yields general-purpose representations competitive with specialized supervised models.

DINOv2: Learning Robust Visual Features without Supervision (paper)

Sources

Self-Supervised Learning: The Dark Matter of Intelligence

Blog

Meta AI

A Survey on Self-Supervised Learning: Algorithms, Applications, and Future Trends

Paper

arXiv

A Simple Framework for Contrastive Learning of Visual Representations (SimCLR)

Paper

arXiv

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Paper

arXiv

Masked Autoencoders Are Scalable Vision Learners (MAE)

Paper

arXiv

SSL

How it works

Problem solved

Implementation

Evolution

Sources

Hyperparameters (configurable axes)

Execution paradigm

Parallelism

Hardware requirements