1) Pretraining: the model learns general representations on a very large, diverse corpus, typically via self-supervised objectives (e.g., next-token prediction, masked language modeling, contrastive learning). 2) Adaptation: the same model is adapted to specific tasks via fine-tuning, instruction tuning, RLHF, prompting, or parameter-efficient adapters (LoRA). Scaling parameters, data, and compute is associated with "emergent capabilities" โ abilities not observed in smaller models.
Removes the need to train a separate model from scratch for each task โ one large, general model adapts to many applications at low marginal cost.
Web-scale corpora contain test data from known benchmarks, toxic content, and factual errors. Without deduplication and filtering, metrics are overstated and the model memorizes specific examples.
Foundation models trained on internet data often include evaluation sets within the pretraining corpus, inflating scores on MMLU, HumanEval, GSM8K, etc.
Pretraining a foundation model costs USD 10Mโ100M+ and requires clusters of 1000+ accelerators. Hardware failures, loss spikes, and checkpoint restarts are the rule, not the exception.
A foundation model trained on next-token prediction is not automatically helpful/harmless/honest. It requires post-training (SFT + RLHF/DPO) to behave according to user intent.
BERT (Google) and GPT (OpenAI) established the 'pretrain-then-adapt' paradigm as the NLP standard.
GPT-3 demonstrated that scale gives rise to few-shot capabilities without task-specific fine-tuning.
Bommasani et al. formalize the paradigm and introduce the name.
Extension of the paradigm beyond text โ image, video, audio.
Google DeepMind brings the paradigm to robotics by combining VLM with manipulation.
Open-weight models become competitive with closed counterparts.
Google DeepMind introduces RT-2, combining VLM (PaLI-X) with robotic manipulation โ the first widely adopted robotics foundation model.
DeepMind shows that most contemporary foundation models were undertrained โ for a fixed compute budget, scaling data beats scaling parameters.
Number of trainable parameters. Parameter scale is one of three scaling-law dimensions (alongside data and compute) and correlates with the emergence of new capabilities.
Number of tokens (or samples) seen during pretraining. Chinchilla scaling laws (Hoffmann et al., 2022) suggest ~20 tokens per parameter as compute-optimal.
Total compute (in FLOPs) spent on pretraining. The third scaling-law dimension โ together with parameters and data, determines the compute-optimal model.
Composition and proportions of data sources in the pretraining corpus (e.g. web, code, books, multimodal). Determines the model's capability profile.
Maximum sequence length supported by the model. Has grown over generations โ from 512 (BERT) to 1M+ tokens (Gemini 1.5, Llama 4).
Input/output modalities supported by the model (text, image, audio, video, robotic actions). A foundation model may be unimodal or multimodal.
Foundation model is a paradigm, not a specific architecture โ execution paradigm reflects the most common realization (dense Transformer). Specific foundation models may use sparse/MoE.
Pretraining foundation models requires massive parallelism (data + model + pipeline parallelism) across clusters of thousands of accelerators.
Pretraining foundation models requires massive mixed-precision matrix operations (BF16/FP16/FP8) โ the native domain of tensor-core GPUs (NVIDIA H100/B200, AMD MI300).
Google TPUs (v4/v5p/Trillium) were designed from the start for pretraining large models; PaLM, Gemini, and many Google foundation models were trained on TPUs.