Raw data corpus
Source of data for self-supervised training
Massive unlabeled dataset (web crawl, code, books, video, robot telemetry). Typical scale: 10¹²–10¹³ tokens for LLMs.
Training a model on massive unlabeled corpora using self-supervised objectives (e.g., next-token prediction, masked language modeling) to learn general-purpose representations before fine-tuning for specific tasks.
The model receives an input fragment with partially hidden or shifted information (next-token prediction in GPT, masked language modeling in BERT, contrastive learning in CLIP, next-frame prediction in world models). A loss function measures reconstruction/prediction quality. Training runs on GPU/TPU clusters for weeks or months over trillions of tokens. The pretrained model becomes a foundation that can be further fine-tuned, instruction-tuned, RLHF-aligned, or LoRA-adapted for specific applications.
Traditional supervised learning required enormous hand-labeled datasets per task, which did not scale. Self-supervised pretraining solves this by learning from raw unlabeled data — practically unlimited in supply — and transferring that knowledge to many downstream tasks with minimal supervised fine-tuning.
Source of data for self-supervised training
Massive unlabeled dataset (web crawl, code, books, video, robot telemetry). Typical scale: 10¹²–10¹³ tokens for LLMs.
Loss function without human labels
Predictive task that uses the data structure as the training signal — next-token prediction, masked language modeling, contrastive loss, next-frame prediction.
Backbone holding the representations learned during pretraining
Most often a Transformer (encoder-only, decoder-only, or encoder-decoder); also Diffusion Models in generative video and images.
Training infrastructure
Thousands of GPUs/TPUs running in parallel for weeks/months. Pretraining a GPT-4-class LLM typically requires 10²⁵+ FLOPs.
Fully parallel
Pretraining is fully parallel across data parallelism + tensor parallelism + pipeline parallelism. Gradient synchronization is the main bottleneck on very large clusters.
Dense
All paths active
In standard pretraining, all parameters are updated at every step. MoE variants introduce sparse activation, but pretraining itself remains dense in the backward pass.
Corpus size (tokens)
Number of tokens in the training corpus. Scale: 10⁹ (small models) to 10¹³+ (frontier LLMs).
Model size (parameters)
Number of model parameters. Chinchilla scaling laws suggest an optimal tokens-to-params ratio of about 20:1.
Self-supervised objective type
Choice of task: causal LM (GPT), masked LM (BERT), contrastive (CLIP), denoising (T5), next-frame (world models).
Compute budget (FLOPs)
Total floating-point operations. GPT-3 ≈ 3·10²³, GPT-4 ≈ 2·10²⁵, frontier 2025+ ≈ 10²⁶.
Data quality filtering
Deduplication, quality classification, and toxicity filtering pipeline. Determines the effective ratio of "useful tokens".
Benchmark data (MMLU, HellaSwag) leaking into the pretraining corpus artificially inflates evaluation scores.
Decontamination pipeline — remove benchmark n-grams from the training corpus and evaluate on fresh datasets (held-out, post-training).
With large learning rates and fp16, training loss can spike and corrupt weights. Restart requires a checkpoint from days earlier.
Mixed precision (bfloat16), gradient clipping, learning rate warmup, frequent checkpointing, gradient-statistics monitoring.
Training too large a model on too little data (pre-Chinchilla) wastes compute and underperforms a smaller model on a larger corpus.
Apply Chinchilla scaling laws (~20 tokens/param) or newer ones (Llama 3 trained at >100 tokens/param for inference efficiency).
Raw web crawl contains duplicates, spam, low-quality, and toxic content. Without filtering, the result is a model weaker than one trained on a 10× smaller clean corpus.
Deduplication pipeline (MinHash, exact match), quality classification (FastText, Wikipedia-style classifier), toxicity filtering.
GENESIS · Source paper
Improving Language Understanding by Generative Pre-TrainingWord2Vec — pretraining of word embeddings
breakthroughMikolov et al. show that self-supervised pretraining (skip-gram, CBOW) yields general-purpose word representations.
GPT-1 and BERT — pretraining + fine-tuning as a paradigm
breakthroughOpenAI GPT (causal LM) and Google BERT (masked LM) establish the standard: large pretraining + small task-specific fine-tuning.
GPT-3 — pretraining produces models capable of in-context learning
breakthrough175B parameters + 300B tokens demonstrates that pretrained knowledge alone solves many tasks without fine-tuning.
CLIP — multimodal contrastive pretraining
OpenAI CLIP unifies image and text in one embedding space via contrastive pretraining on 400M pairs.
Chinchilla — optimal tokens-to-parameters ratio
breakthroughDeepMind shows that prior LLMs were undertrained — compute-optimal training needs roughly 20 tokens per parameter.
Llama 2 — frontier-scale open-weight pretraining
Meta releases weights of a model trained on 2T tokens, democratizing access to large pretrained models.
Robotics foundation models — pretraining for VLA
breakthroughPi-Zero (Physical Intelligence), Gemini Robotics, and RT-2 apply pretraining on multimodal + robot data as the foundation of VLAs.
Frontier-scale pretraining — 10²⁶ FLOPs
GPT-5, Gemini 3, Claude Opus 4, and Grok 4 reach scales requiring clusters of 100k+ H100/B200 GPUs.
LLM pretraining is the dominant workload for H100/B200/GB200 GPUs — fp16/bf16/fp8 GEMM ops are their primary design target.
Google TPU v4/v5/Trillium are designed around pretraining Gemini and earlier models — high systolic-array throughput and InterChip Interconnect.
CPUs can train small R&D models, but frontier-scale pretraining is infeasible on CPUs due to limited tensor-ops throughput.
Supervised Fine-Tuning (SFT) is a post-training stage in which a pre-trained language model is further optimized on a labeled set of (prompt, response) pairs. Each pair contains an instruction or question and a reference response written by a human or filtered automatically. The model minimizes cross-entropy loss on the response tokens. SFT is the first stage of the RLHF pipeline (Ouyang et al., 2022) and is critical for teaching the model to follow instructions. SFT alone can significantly improve model usability without requiring reinforcement learning. The method is used in InstructGPT, ChatGPT, Llama-2-Chat, and many other models.
GO TO CONCEPTTransformer is a neural network architecture proposed by Vaswani et al. in „Attention Is All You Need" (NeurIPS 2017). It replaced earlier approaches based on recurrent (RNN, LSTM) and convolutional (CNN) networks in sequential tasks. The key element is the multi-head self-attention mechanism, which allows every position in a sequence to directly participate in computations involving every other position, enabling the model to learn long-range dependencies in constant (not linear, as in RNNs) time. The architecture consists of encoder and decoder blocks (or encoder-only / decoder-only variants) containing: multi-head attention layers, feed-forward networks, residual connections, and layer normalization (LayerNorm). Sequence positions are encoded via positional encoding (sinusoidal or learned). Transformer has become the foundation of LLMs (GPT, BERT, T5, LLaMA, Claude, Gemini), Vision Transformers (ViT), multimodal models (CLIP, Flamingo), and tabular foundation models (TabPFN). The main limitation — quadratic attention complexity with respect to sequence length (O(n²)) — is an active research direction (FlashAttention, sliding window, linear attention, SSM).
GO TO CONCEPTA Large Language Model (LLM) is a class of machine learning models based on the Transformer architecture, trained on large text datasets via autoregressive language modeling (next-token prediction). These models have billions of parameters and can generate coherent text, answer questions, write code, translate languages, and perform many other language-cognitive tasks without task-specific fine-tuning. The term covers models such as GPT, LLaMA, Gemini, Claude, and Mistral. Most modern LLMs are instruction-tuned (SFT + RLHF) after the pre-training phase.
GO TO CONCEPTInstruction Tuning (also called instruction fine-tuning or supervised fine-tuning, SFT) is a post-pretraining technique for language models. A pretrained model is fine-tuned on a curated dataset of examples, where each example consists of a natural language instruction describing a task, an optional input context, and the expected output. The training objective is standard supervised learning: cross-entropy loss over the target output tokens, with loss masked on the instruction/input portions. The key finding, established by Wei et al. (2021) in the FLAN paper, is that training on a sufficiently large and diverse set of instruction-formatted tasks improves zero-shot generalization to unseen task types. This generalization scales with the number of task clusters and the model size. Instruction Tuning is distinct from RLHF (Reinforcement Learning from Human Feedback): it uses only supervised learning on demonstration data, without a reward model or RL optimization. In practice, instruction tuning is often the first stage in a post-training pipeline, followed optionally by RLHF or direct preference optimization (DPO). Common dataset formats include the Alpaca three-field format (instruction, input, output) and the multi-turn conversation format used in chat models.
GO TO CONCEPTReinforcement Learning from Human Feedback (RLHF) is a multi-stage training pipeline used to align language models and other AI systems with human preferences and intent. The approach was formally introduced for deep RL in Christiano et al. (2017), and scaled to large language models in Ouyang et al. (2022) (InstructGPT), where it became the primary alignment technique for systems such as ChatGPT, Claude, and Gemini. The standard RLHF pipeline for LLMs consists of three sequential stages: 1. Supervised Fine-Tuning (SFT): A pretrained language model is fine-tuned on a curated dataset of high-quality (prompt, response) pairs produced by human annotators, yielding a base aligned policy π_SFT. 2. Reward Model Training: Human annotators compare pairs of model responses to the same prompt and express preferences (which response is better). These pairwise comparisons are used to train a scalar reward model r_φ(x, y), typically using a Bradley-Terry model as the preference objective: loss = -E[log σ(r(x, y_w) - r(x, y_l))], where y_w is the preferred and y_l the rejected response. 3. RL Fine-Tuning via PPO: The SFT-initialized policy π_θ is optimized with Proximal Policy Optimization (PPO) to maximize the reward from r_φ, subject to a KL divergence penalty that prevents the policy from drifting too far from π_SFT: Objective(x, y) = r_φ(x, y) − β · KL(π_θ(y|x) || π_SFT(y|x)). The KL penalty with coefficient β is critical to prevent reward hacking. During PPO training, four models are needed simultaneously: the active policy, a frozen reference policy (π_SFT), the reward model, and a value/critic network. This makes RLHF computationally expensive, requiring substantial GPU memory. A key limitation is reward hacking: since the reward model is a proxy for human preferences trained on finite data, the policy can find ways to exploit its imperfections — generating outputs that score highly on the reward model but are degenerate or low-quality. The KL penalty is the primary mitigation mechanism. Direct Preference Optimization (DPO, Rafailov et al., 2023) was proposed as a mathematically equivalent simplification of RLHF that eliminates the explicit reward model and RL training loop, replacing them with a single supervised loss directly on preference pairs.
GO TO CONCEPTLoRA (Low-Rank Adaptation of Large Language Models) is a parameter-efficient fine-tuning (PEFT) technique proposed by Hu et al. (2021). Instead of updating all parameters of a pretrained model, LoRA freezes the original weight matrix W₀ and learns the weight change ΔW as a low-rank decomposition ΔW = BA, where B ∈ ℝ^(d×r), A ∈ ℝ^(r×k), and r ≪ min(d, k) is the rank. The adapted weight is W' = W₀ + (α/r)·BA, where α is a scaling hyperparameter. B is initialized to zero and A with random Gaussian values, ensuring the initial adapted output is identical to the pretrained model. During training, W₀ is frozen and only A and B are updated via gradient descent. After training, BA can be merged into W₀ (W = W₀ + BA), eliminating any inference latency relative to the original model. Trainable parameters per adapted layer are r·(d + k) instead of d·k, yielding efficiency gains of d·k / (r·(d+k)) — approximately 100× for typical transformer layers with r=8 and d=k=1024. LoRA was originally applied to query (Wq) and value (Wv) projection matrices in transformer self-attention, though practitioners often apply it to all linear layers for maximum performance. Key PEFT context: LoRA belongs to the reparametrization-based PEFT category, alongside adapters and prefix tuning. Its main advantage over adapter layers is zero additional inference latency after weight merging. Common variants include QLoRA (4-bit quantized base model with LoRA adapters), AdaLoRA (adaptive rank allocation via SVD), DoRA (weight-decomposed adaptation of direction and magnitude), and rsLoRA (rank-stabilized scaling α/√r).
GO TO CONCEPTScaling Laws are empirical regularities discovered by Kaplan et al. (2020) at OpenAI, describing how the performance of language models changes predictably with model size (parameter count N), dataset size (D), and compute budget (C). Cross-entropy loss scales as power laws with each of these three variables across many orders of magnitude. The study showed that architectural configuration (depth, width) has minimal impact at fixed N and C, that larger models are significantly more sample-efficient, and that optimally efficient training requires very large models on a relatively modest amount of data with early stopping. Hoffmann et al. (Chinchilla, 2022) refined these laws, showing that earlier models (including GPT-3) were massively undertrained and that optimal N and D should scale equally.
GO TO CONCEPTSelf-attention is a computational mechanism introduced in the Transformer architecture (Vaswani et al., 2017). For each token in the input sequence, it computes a contextual representation as a weighted sum of values (V) of all tokens, where the weights arise from the cosine similarity between the queries (Q) and keys (K) of that token and all others. This allows every token to directly attend to information from any other position in the sequence, regardless of distance, overcoming the limitations of recurrent neural networks in modeling long-range dependencies.
GO TO CONCEPTSupervised Fine-Tuning (SFT) is a post-training stage in which a pre-trained language model is further optimized on a labeled set of (prompt, response) pairs. Each pair contains an instruction or question and a reference response written by a human or filtered automatically. The model minimizes cross-entropy loss on the response tokens. SFT is the first stage of the RLHF pipeline (Ouyang et al., 2022) and is critical for teaching the model to follow instructions. SFT alone can significantly improve model usability without requiring reinforcement learning. The method is used in InstructGPT, ChatGPT, Llama-2-Chat, and many other models.
GO TO CONCEPTIn-Context Learning (ICL) is the ability of large language models to perform a new task from a handful of examples (called demonstrations or shots) given directly in the prompt, without modifying model weights. The concept was formalized by Brown et al. (2020) in the GPT-3 paper "Language Models are Few-Shot Learners" as an emergent capability of models at ≥175B-parameter scale. In ICL, the prompt contains k (input, output) pairs demonstrating the task, followed by a new query input. Conditioned on these examples, the model produces output following the demonstration pattern. The number of examples k defines variants: zero-shot (k=0, natural-language task description only), one-shot (k=1), and few-shot (k=2–32, typically 4–8). Brown et al. showed that GPT-3 175B achieves competitive performance against fine-tuned models on many NLP tasks — using few-shot prompting alone. The underlying mechanism of ICL remains an active research topic. Main hypotheses: (1) ICL implements implicit gradient descent in attention activation space (Akyürek et al. 2022, von Oswald et al. 2023); (2) models perform pattern matching over distributions of patterns seen during pretraining (Xie et al. 2022 — Bayesian inference framework); (3) ICL relies on induction heads — attention structures forming during pretraining (Olsson et al. 2022, Anthropic). Empirically, demonstration quality, ordering, and even labels significantly affect performance (Min et al. 2022). ICL is the foundation of a broader family of prompt-engineering techniques: Chain-of-Thought (Wei et al. 2022) extends ICL with reasoning chains in demonstrations, instruction tuning (FLAN, T0) strengthens zero-shot ICL, and Retrieval-Augmented Generation dynamically selects demonstrations from a knowledge base. ICL became the dominant paradigm for using LLMs from 2022–2024, before being supplemented by instruction-tuned models requiring fewer or no examples.
GO TO CONCEPTVideo pretraining is a family of self-supervised learning methods in which a model learns visual representations from raw video sequences. Instead of manually labelled data, the model optimises through next-frame prediction, masked video modelling, or latent feature prediction. This approach is central to robotic foundation models, enabling scalable acquisition of interaction physics from large datasets such as Open-X-Embodiment.
GO TO CONCEPT| Title | Publisher | Type |
|---|---|---|
| Improving Language Understanding by Generative Pre-Training (GPT-1) | OpenAI | scientific article |
| BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding | Google AI / arXiv | scientific article |
| Language Models are Few-Shot Learners (GPT-3) | OpenAI / arXiv | scientific article |
| Training Compute-Optimal Large Language Models (Chinchilla) | DeepMind / arXiv | scientific article |
| Learning Transferable Visual Models From Natural Language Supervision (CLIP) | OpenAI / arXiv | scientific article |