Robots AtlasRobots Atlas

Supervised Fine-Tuning

Enabled adaptation of large pre-trained language models to specific tasks and instruction-following behavior using relatively small, labeled datasets of demonstrations.

Category
Abstraction level
Operation level
Chatbots and language assistantsInstruction model fine-tuningFirst stage of RLHFDomain specialization of models

The SFT dataset contains (prompt p, response y) pairs. The loss is L = -sum log P(y_t | p, y_<t). The model is trained with gradient descent on these pairs, typically with a small learning rate. Techniques like LoRA or QLoRA are often used to reduce compute costs. Data may come from human annotators (e.g. FLAN, Dolly) or be synthetically generated by a stronger model.

Pre-trained models are good at text completion but not at following user instructions, answering questions in chat format, or generating safe and helpful responses.

Common pitfalls

Catastrophic forgetting
HIGH

SFT on a narrow dataset can cause the model to forget previously learned capabilities. Use diverse datasets or regularization.

Overfitting on small SFT dataset
MEDIUM

With too few examples or too many epochs, the model memorizes demonstrations rather than generalizing.

Data quality is critical
HIGH

Noisy, inconsistent, or biased SFT data is directly reflected in model behavior. Quality > quantity.

GENESIS · Source paper

Training language models to follow instructions with human feedback
2022NeurIPS 2022Long Ouyang, Jeff Wu, Xu Jiang et al.
2019

Pre-training + fine-tuning paradigm (GPT-1, BERT)

breakthrough

Radford et al. and Devlin et al. establish the pre-train/fine-tune paradigm.

2021

FLAN - SFT on instruction datasets

Wei et al. show that fine-tuning on diverse instruction datasets improves zero-shot performance.

2022

InstructGPT - SFT as stage 1 of RLHF

breakthrough

Ouyang et al. formalize SFT as the first step before reward modeling and PPO.

2023

LoRA and QLoRA - efficient SFT

Hu et al. (LoRA) and Dettmers et al. (QLoRA) enable SFT on consumer hardware by training only low-rank adapters.

Commonly used with

RLHF

Reinforcement Learning from Human Feedback (RLHF) is a multi-stage training pipeline used to align language models and other AI systems with human preferences and intent. The approach was formally introduced for deep RL in Christiano et al. (2017), and scaled to large language models in Ouyang et al. (2022) (InstructGPT), where it became the primary alignment technique for systems such as ChatGPT, Claude, and Gemini. The standard RLHF pipeline for LLMs consists of three sequential stages: 1. Supervised Fine-Tuning (SFT): A pretrained language model is fine-tuned on a curated dataset of high-quality (prompt, response) pairs produced by human annotators, yielding a base aligned policy π_SFT. 2. Reward Model Training: Human annotators compare pairs of model responses to the same prompt and express preferences (which response is better). These pairwise comparisons are used to train a scalar reward model r_φ(x, y), typically using a Bradley-Terry model as the preference objective: loss = -E[log σ(r(x, y_w) - r(x, y_l))], where y_w is the preferred and y_l the rejected response. 3. RL Fine-Tuning via PPO: The SFT-initialized policy π_θ is optimized with Proximal Policy Optimization (PPO) to maximize the reward from r_φ, subject to a KL divergence penalty that prevents the policy from drifting too far from π_SFT: Objective(x, y) = r_φ(x, y) − β · KL(π_θ(y|x) || π_SFT(y|x)). The KL penalty with coefficient β is critical to prevent reward hacking. During PPO training, four models are needed simultaneously: the active policy, a frozen reference policy (π_SFT), the reward model, and a value/critic network. This makes RLHF computationally expensive, requiring substantial GPU memory. A key limitation is reward hacking: since the reward model is a proxy for human preferences trained on finite data, the policy can find ways to exploit its imperfections — generating outputs that score highly on the reward model but are degenerate or low-quality. The KL penalty is the primary mitigation mechanism. Direct Preference Optimization (DPO, Rafailov et al., 2023) was proposed as a mathematically equivalent simplification of RLHF that eliminates the explicit reward model and RL training loop, replacing them with a single supervised loss directly on preference pairs.

GO TO CONCEPT
Instruction Tuning

Instruction Tuning (also called instruction fine-tuning or supervised fine-tuning, SFT) is a post-pretraining technique for language models. A pretrained model is fine-tuned on a curated dataset of examples, where each example consists of a natural language instruction describing a task, an optional input context, and the expected output. The training objective is standard supervised learning: cross-entropy loss over the target output tokens, with loss masked on the instruction/input portions. The key finding, established by Wei et al. (2021) in the FLAN paper, is that training on a sufficiently large and diverse set of instruction-formatted tasks improves zero-shot generalization to unseen task types. This generalization scales with the number of task clusters and the model size. Instruction Tuning is distinct from RLHF (Reinforcement Learning from Human Feedback): it uses only supervised learning on demonstration data, without a reward model or RL optimization. In practice, instruction tuning is often the first stage in a post-training pipeline, followed optionally by RLHF or direct preference optimization (DPO). Common dataset formats include the Alpaca three-field format (instruction, input, output) and the multi-turn conversation format used in chat models.

GO TO CONCEPT