Pretraining (Self-Supervised Pretraining)

Training a model on massive unlabeled corpora using self-supervised objectives (e.g., next-token prediction, masked language modeling) to learn general-purpose representations before fine-tuning for specific tasks.

Raw data corpus

Source of data for self-supervised training

Massive unlabeled dataset (web crawl, code, books, video, robot telemetry). Typical scale: 10¹²–10¹³ tokens for LLMs.

Self-supervised objective

Loss function without human labels

Predictive task that uses the data structure as the training signal — next-token prediction, masked language modeling, contrastive loss, next-frame prediction.

Base architecture

Backbone holding the representations learned during pretraining

Most often a Transformer (encoder-only, decoder-only, or encoder-decoder); also Diffusion Models in generative video and images.

Compute cluster

Training infrastructure

Thousands of GPUs/TPUs running in parallel for weeks/months. Pretraining a GPT-4-class LLM typically requires 10²⁵+ FLOPs.

Parallelism

Fully parallel

Pretraining is fully parallel across data parallelism + tensor parallelism + pipeline parallelism. Gradient synchronization is the main bottleneck on very large clusters.

Paradigm

Dense

All paths active

In standard pretraining, all parameters are updated at every step. MoE variants introduce sparse activation, but pretraining itself remains dense in the backward pass.

Corpus size (tokens)

Critical

Number of tokens in the training corpus. Scale: 10⁹ (small models) to 10¹³+ (frontier LLMs).

Model size (parameters)

Critical

Number of model parameters. Chinchilla scaling laws suggest an optimal tokens-to-params ratio of about 20:1.

Self-supervised objective type

Critical

Choice of task: causal LM (GPT), masked LM (BERT), contrastive (CLIP), denoising (T5), next-frame (world models).

Compute budget (FLOPs)

Standard

Total floating-point operations. GPT-3 ≈ 3·10²³, GPT-4 ≈ 2·10²⁵, frontier 2025+ ≈ 10²⁶.

Data quality filtering

Standard

Deduplication, quality classification, and toxicity filtering pipeline. Determines the effective ratio of "useful tokens".

Common pitfalls

Data contamination

HIGH

Benchmark data (MMLU, HellaSwag) leaking into the pretraining corpus artificially inflates evaluation scores.

Decontamination pipeline — remove benchmark n-grams from the training corpus and evaluate on fresh datasets (held-out, post-training).

Loss spikes and training instability

CRITICAL

With large learning rates and fp16, training loss can spike and corrupt weights. Restart requires a checkpoint from days earlier.

Mixed precision (bfloat16), gradient clipping, learning rate warmup, frequent checkpointing, gradient-statistics monitoring.

Suboptimal tokens-to-parameters ratio

HIGH

Training too large a model on too little data (pre-Chinchilla) wastes compute and underperforms a smaller model on a larger corpus.

Apply Chinchilla scaling laws (~20 tokens/param) or newer ones (Llama 3 trained at >100 tokens/param for inference efficiency).

Low data quality

HIGH

Raw web crawl contains duplicates, spam, low-quality, and toxic content. Without filtering, the result is a model weaker than one trained on a 10× smaller clean corpus.

Deduplication pipeline (MinHash, exact match), quality classification (FastText, Wikipedia-style classifier), toxicity filtering.

GENESIS · Source paper

Improving Language Understanding by Generative Pre-Training

2018OpenAI Tech ReportAlec Radford, Karthik Narasimhan, Tim Salimans et al.

2013

Word2Vec — pretraining of word embeddings

breakthrough

Mikolov et al. show that self-supervised pretraining (skip-gram, CBOW) yields general-purpose word representations.

2018

GPT-1 and BERT — pretraining + fine-tuning as a paradigm

breakthrough

OpenAI GPT (causal LM) and Google BERT (masked LM) establish the standard: large pretraining + small task-specific fine-tuning.

2020

GPT-3 — pretraining produces models capable of in-context learning

breakthrough

175B parameters + 300B tokens demonstrates that pretrained knowledge alone solves many tasks without fine-tuning.

2021

CLIP — multimodal contrastive pretraining

OpenAI CLIP unifies image and text in one embedding space via contrastive pretraining on 400M pairs.

2022

Chinchilla — optimal tokens-to-parameters ratio

breakthrough

DeepMind shows that prior LLMs were undertrained — compute-optimal training needs roughly 20 tokens per parameter.

2023

Llama 2 — frontier-scale open-weight pretraining

Meta releases weights of a model trained on 2T tokens, democratizing access to large pretrained models.

2024

Robotics foundation models — pretraining for VLA

breakthrough

Pi-Zero (Physical Intelligence), Gemini Robotics, and RT-2 apply pretraining on multimodal + robot data as the foundation of VLAs.

2025

Frontier-scale pretraining — 10²⁶ FLOPs

GPT-5, Gemini 3, Claude Opus 4, and Grok 4 reach scales requiring clusters of 100k+ H100/B200 GPUs.

GPU Tensor CoresPRIMARY

LLM pretraining is the dominant workload for H100/B200/GB200 GPUs — fp16/bf16/fp8 GEMM ops are their primary design target.

TPUPRIMARY

Google TPU v4/v5/Trillium are designed around pretraining Gemini and earlier models — high systolic-array throughput and InterChip Interconnect.

CPU AVXLIMITED

CPUs can train small R&D models, but frontier-scale pretraining is infeasible on CPUs due to limited tensor-ops throughput.

ALTERNATIVE TO

SFT

Supervised Fine-Tuning (SFT) is a post-training stage in which a pre-trained language model is further optimized on a labeled set of (prompt, response) pairs. Each pair contains an instruction or question and a reference response written by a human or filtered automatically. The model minimizes cross-entropy loss on the response tokens. SFT is the first stage of the RLHF pipeline (Ouyang et al., 2022) and is critical for teaching the model to follow instructions. SFT alone can significantly improve model usability without requiring reinforcement learning. The method is used in InstructGPT, ChatGPT, Llama-2-Chat, and many other models.

GO TO CONCEPT

Commonly used with

Transformer

Transformer is a neural network architecture proposed by Vaswani et al. in „Attention Is All You Need" (NeurIPS 2017). It replaced earlier approaches based on recurrent (RNN, LSTM) and convolutional (CNN) networks in sequential tasks. The key element is the multi-head self-attention mechanism, which allows every position in a sequence to directly participate in computations involving every other position, enabling the model to learn long-range dependencies in constant (not linear, as in RNNs) time. The architecture consists of encoder and decoder blocks (or encoder-only / decoder-only variants) containing: multi-head attention layers, feed-forward networks, residual connections, and layer normalization (LayerNorm). Sequence positions are encoded via positional encoding (sinusoidal or learned). Transformer has become the foundation of LLMs (GPT, BERT, T5, LLaMA, Claude, Gemini), Vision Transformers (ViT), multimodal models (CLIP, Flamingo), and tabular foundation models (TabPFN). The main limitation — quadratic attention complexity with respect to sequence length (O(n²)) — is an active research direction (FlashAttention, sliding window, linear attention, SSM).

GO TO CONCEPT

LLM

A Large Language Model (LLM) is a class of machine learning models based on the Transformer architecture, trained on large text datasets via autoregressive language modeling (next-token prediction). These models have billions of parameters and can generate coherent text, answer questions, write code, translate languages, and perform many other language-cognitive tasks without task-specific fine-tuning. The term covers models such as GPT, LLaMA, Gemini, Claude, and Mistral. Most modern LLMs are instruction-tuned (SFT + RLHF) after the pre-training phase.

GO TO CONCEPT

Instruction Tuning

Instruction Tuning (also called instruction fine-tuning or supervised fine-tuning, SFT) is a post-pretraining technique for language models. A pretrained model is fine-tuned on a curated dataset of examples, where each example consists of a natural language instruction describing a task, an optional input context, and the expected output. The training objective is standard supervised learning: cross-entropy loss over the target output tokens, with loss masked on the instruction/input portions. The key finding, established by Wei et al. (2021) in the FLAN paper, is that training on a sufficiently large and diverse set of instruction-formatted tasks improves zero-shot generalization to unseen task types. This generalization scales with the number of task clusters and the model size. Instruction Tuning is distinct from RLHF (Reinforcement Learning from Human Feedback): it uses only supervised learning on demonstration data, without a reward model or RL optimization. In practice, instruction tuning is often the first stage in a post-training pipeline, followed optionally by RLHF or direct preference optimization (DPO). Common dataset formats include the Alpaca three-field format (instruction, input, output) and the multi-turn conversation format used in chat models.

GO TO CONCEPT

RLHF

Reinforcement Learning from Human Feedback (RLHF) is a multi-stage training pipeline used to align language models and other AI systems with human preferences and intent. The approach was formally introduced for deep RL in Christiano et al. (2017), and scaled to large language models in Ouyang et al. (2022) (InstructGPT), where it became the primary alignment technique for systems such as ChatGPT, Claude, and Gemini. The standard RLHF pipeline for LLMs consists of three sequential stages: 1. Supervised Fine-Tuning (SFT): A pretrained language model is fine-tuned on a curated dataset of high-quality (prompt, response) pairs produced by human annotators, yielding a base aligned policy π_SFT. 2. Reward Model Training: Human annotators compare pairs of model responses to the same prompt and express preferences (which response is better). These pairwise comparisons are used to train a scalar reward model r_φ(x, y), typically using a Bradley-Terry model as the preference objective: loss = -E[log σ(r(x, y_w) - r(x, y_l))], where y_w is the preferred and y_l the rejected response. 3. RL Fine-Tuning via PPO: The SFT-initialized policy π_θ is optimized with Proximal Policy Optimization (PPO) to maximize the reward from r_φ, subject to a KL divergence penalty that prevents the policy from drifting too far from π_SFT: Objective(x, y) = r_φ(x, y) − β · KL(π_θ(y|x) || π_SFT(y|x)). The KL penalty with coefficient β is critical to prevent reward hacking. During PPO training, four models are needed simultaneously: the active policy, a frozen reference policy (π_SFT), the reward model, and a value/critic network. This makes RLHF computationally expensive, requiring substantial GPU memory. A key limitation is reward hacking: since the reward model is a proxy for human preferences trained on finite data, the policy can find ways to exploit its imperfections — generating outputs that score highly on the reward model but are degenerate or low-quality. The KL penalty is the primary mitigation mechanism. Direct Preference Optimization (DPO, Rafailov et al., 2023) was proposed as a mathematically equivalent simplification of RLHF that eliminates the explicit reward model and RL training loop, replacing them with a single supervised loss directly on preference pairs.

GO TO CONCEPT

PEFT / LoRA

LoRA (Low-Rank Adaptation of Large Language Models) is a parameter-efficient fine-tuning (PEFT) technique proposed by Hu et al. (2021). Instead of updating all parameters of a pretrained model, LoRA freezes the original weight matrix W₀ and learns the weight change ΔW as a low-rank decomposition ΔW = BA, where B ∈ ℝ^(d×r), A ∈ ℝ^(r×k), and r ≪ min(d, k) is the rank. The adapted weight is W' = W₀ + (α/r)·BA, where α is a scaling hyperparameter. B is initialized to zero and A with random Gaussian values, ensuring the initial adapted output is identical to the pretrained model. During training, W₀ is frozen and only A and B are updated via gradient descent. After training, BA can be merged into W₀ (W = W₀ + BA), eliminating any inference latency relative to the original model. Trainable parameters per adapted layer are r·(d + k) instead of d·k, yielding efficiency gains of d·k / (r·(d+k)) — approximately 100× for typical transformer layers with r=8 and d=k=1024. LoRA was originally applied to query (Wq) and value (Wv) projection matrices in transformer self-attention, though practitioners often apply it to all linear layers for maximum performance. Key PEFT context: LoRA belongs to the reparametrization-based PEFT category, alongside adapters and prefix tuning. Its main advantage over adapter layers is zero additional inference latency after weight merging. Common variants include QLoRA (4-bit quantized base model with LoRA adapters), AdaLoRA (adaptive rank allocation via SVD), DoRA (weight-decomposed adaptation of direction and magnitude), and rsLoRA (rank-stabilized scaling α/√r).

GO TO CONCEPT

Scaling Laws

Scaling Laws are empirical regularities discovered by Kaplan et al. (2020) at OpenAI, describing how the performance of language models changes predictably with model size (parameter count N), dataset size (D), and compute budget (C). Cross-entropy loss scales as power laws with each of these three variables across many orders of magnitude. The study showed that architectural configuration (depth, width) has minimal impact at fixed N and C, that larger models are significantly more sample-efficient, and that optimally efficient training requires very large models on a relatively modest amount of data with early stopping. Hoffmann et al. (Chinchilla, 2022) refined these laws, showing that earlier models (including GPT-3) were massively undertrained and that optimal N and D should scale equally.

GO TO CONCEPT

Self-Attention

Self-attention is a computational mechanism introduced in the Transformer architecture (Vaswani et al., 2017). For each token in the input sequence, it computes a contextual representation as a weighted sum of values (V) of all tokens, where the weights arise from the cosine similarity between the queries (Q) and keys (K) of that token and all others. This allows every token to directly attend to information from any other position in the sequence, regardless of distance, overcoming the limitations of recurrent neural networks in modeling long-range dependencies.

GO TO CONCEPT

SFT

GO TO CONCEPT

ICL

In-Context Learning (ICL) is the ability of large language models to perform a new task from a handful of examples (called demonstrations or shots) given directly in the prompt, without modifying model weights. The concept was formalized by Brown et al. (2020) in the GPT-3 paper "Language Models are Few-Shot Learners" as an emergent capability of models at ≥175B-parameter scale. In ICL, the prompt contains k (input, output) pairs demonstrating the task, followed by a new query input. Conditioned on these examples, the model produces output following the demonstration pattern. The number of examples k defines variants: zero-shot (k=0, natural-language task description only), one-shot (k=1), and few-shot (k=2–32, typically 4–8). Brown et al. showed that GPT-3 175B achieves competitive performance against fine-tuned models on many NLP tasks — using few-shot prompting alone. The underlying mechanism of ICL remains an active research topic. Main hypotheses: (1) ICL implements implicit gradient descent in attention activation space (Akyürek et al. 2022, von Oswald et al. 2023); (2) models perform pattern matching over distributions of patterns seen during pretraining (Xie et al. 2022 — Bayesian inference framework); (3) ICL relies on induction heads — attention structures forming during pretraining (Olsson et al. 2022, Anthropic). Empirically, demonstration quality, ordering, and even labels significantly affect performance (Min et al. 2022). ICL is the foundation of a broader family of prompt-engineering techniques: Chain-of-Thought (Wei et al. 2022) extends ICL with reasoning chains in demonstrations, instruction tuning (FLAN, T0) strengthens zero-shot ICL, and Retrieval-Augmented Generation dynamically selects demonstrations from a knowledge base. ICL became the dominant paradigm for using LLMs from 2022–2024, before being supplemented by instruction-tuned models requiring fewer or no examples.

GO TO CONCEPT

Video Pretraining

Video pretraining is a family of self-supervised learning methods in which a model learns visual representations from raw video sequences. Instead of manually labelled data, the model optimises through next-frame prediction, masked video modelling, or latent feature prediction. This approach is central to robotic foundation models, enabling scalable acquisition of interaction physics from large datasets such as Open-X-Embodiment.

GO TO CONCEPT

Title	Publisher	Type
Improving Language Understanding by Generative Pre-Training (GPT-1)	OpenAI	scientific article
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding	Google AI / arXiv	scientific article
Language Models are Few-Shot Learners (GPT-3)	OpenAI / arXiv	scientific article
Training Compute-Optimal Large Language Models (Chinchilla)	DeepMind / arXiv	scientific article
Learning Transferable Visual Models From Natural Language Supervision (CLIP)	OpenAI / arXiv	scientific article