Robots AtlasRobots Atlas

Instruction Tuning

Fine-tuning a large pretrained language model on NLP tasks framed as natural-language instructions significantly improves zero-shot performance on unseen tasks.

Category
Abstraction level
Operation level
01

Instruction Dataset

Training dataset of (instruction, [input], expected output) examples covering diverse task types. Quality, variety, and number of tasks directly affect the model's generalization ability.

Modular

A curated collection of (instruction, optional input, output) triples covering diverse task types. The breadth of task clusters and template diversity are key factors in how well the tuned model generalizes to unseen tasks. Common formats include the Alpaca format and conversation-style chat templates.

Multi-task instruction datasetHuman demonstration datasetSynthetic instruction dataset
02

Pretrained Base Model

Pre-trained language model fine-tuned on an instruction dataset. The quality and size of the base model determine the upper bound of instruction tuning effectiveness.

Modular

The pretrained language model (typically a causal LM or seq2seq model) that serves as the starting point for fine-tuning. The base model's scale is a key factor: instruction tuning has been shown to provide greater generalization benefits at larger model sizes (Wei et al. 2021).

03

Supervised Fine-Tuning Objective

Training objective: minimizes cross-entropy loss on response tokens, with loss masked on instruction tokens. Gradients are back-propagated exclusively through output tokens.

Standard token-level cross-entropy loss computed only on the target response tokens. Loss is masked (set to -100 or equivalent) on the instruction and input tokens, so the model only learns to predict the output. Training uses standard gradient descent with backpropagation through the model weights.

04

Instruction Template

Text format or template that converts examples from original datasets into natural-language instruction form. Template diversity improves model generalization.

Modular

A natural language prompt template that frames each training example as an instruction. FLAN used 10 distinct templates per dataset. Templates typically include: a verb-based task description, the input context (if any), and a prompt for the output. Template diversity is an important factor in zero-shot generalization.

Parallelism

Fully parallel

Instruction tuning is a standard supervised fine-tuning procedure. Training examples are independent and can be processed in parallel across GPUs/TPUs using data parallelism and tensor parallelism. No sequential dependencies exist between training examples.

Number and diversity of task types

Critical
  • 62 task clusters (FLAN 2021)Original FLAN paper: 62 NLP datasets across 12 task clusters
  • 1836 tasks (Flan 2022/Flan-T5)Scaling instruction fine-tuning: Chung et al. 2022

The number and diversity of task types included in the instruction dataset. Ablation studies in Wei et al. (2021) and Chung et al. (2022) show that more task clusters systematically improve zero-shot generalization on unseen tasks.

Model scale

Critical
  • 8B parametersCommon scale for open instruction-tuned models (e.g., Llama-3-8B-Instruct)
  • 137B parametersScale used in the original FLAN experiments (LaMDA-PT)

The number of parameters in the pretrained base model. Wei et al. (2021) found that instruction tuning generalization benefits increase with model scale, with smaller models showing minimal improvement.

Number of training examples

Standard
  • ~13,000InstructGPT SFT dataset (Ouyang et al. 2022)
  • ~52,000Stanford Alpaca dataset
  • ~1,000,000+Large-scale instruction datasets (e.g., FLAN mixture)

The total number of (instruction, output) examples used for fine-tuning. Instruction tuning can be effective with relatively small datasets (thousands to hundreds of thousands) compared to pretraining.

Chain-of-Thought Data Inclusion

Standard
  • No CoTStandard task-instruction pairs only.
  • With CoT examplesMix of standard instructions and step-by-step reasoning examples

Whether chain-of-thought (CoT) examples are included in the instruction tuning mixture. Chung et al. (2022) found that including CoT data significantly improves reasoning capabilities and zero-shot CoT performance without degrading other benchmarks.

Learning rate

Standard
  • 1e-5 to 3e-5Typical range for full fine-tuning of large LLMs
  • 2e-4 to 1e-3Typical range for LoRA/PEFT-based instruction tuning

The step size for gradient updates during SFT. Instruction tuning typically uses smaller learning rates than pretraining to avoid catastrophic forgetting of pretrained knowledge.

Common pitfalls

Catastrophic forgetting of pretrained knowledge
HIGH

Instruction tuning on a small or narrow dataset can cause the model to lose pretrained capabilities (e.g., in-context learning, reasoning on tasks not represented in the fine-tuning distribution). This is a well-documented risk of SFT on small, low-diversity datasets.

Use a sufficiently large and diverse instruction dataset covering many task types. Include a small proportion of pretraining-style data in the fine-tuning mixture (pretraining regularization, as in InstructGPT PPO-ptx). Use PEFT methods (LoRA) to limit parameter updates.

Insufficient task diversity
HIGH

Training on too few task types or task clusters limits zero-shot generalization. Wei et al. (2021) ablation studies showed that generalization on unseen tasks consistently improves as the number of task clusters in training increases.

Include examples from as many distinct task types as possible. Use multiple prompt templates per task to increase template diversity. Include chain-of-thought examples when reasoning is required.

Low quality instruction data
CRITICAL

The quality of instruction-response pairs directly determines model behavior. Noisy, inconsistent, or factually incorrect responses in training data cause the model to learn undesirable behaviors including hallucination.

Use human-curated or carefully filtered instruction datasets. Apply quality filtering to synthetic datasets generated by LLMs. Prefer diverse, high-quality demonstrations over large quantities of lower-quality examples.

Incorrect loss masking on instruction tokens
MEDIUM

If loss is computed on both instruction and response tokens, the model learns to predict the instruction tokens too, which wastes model capacity and can produce worse instruction-following behavior. Loss should only be computed on the target response tokens.

Apply a loss mask (e.g., -100 label index) to all instruction/input tokens so that gradient updates only come from response token predictions.

Inconsistent instruction template formatting
MEDIUM

Inconsistent use of prompt templates, special tokens, or chat templates across training examples confuses the model and degrades instruction-following quality.

Choose a single prompt template consistent with the target model's expected format (e.g., the model's official chat template) and apply it uniformly across all training examples.

GENESIS ยท Source paper

Finetuned Language Models Are Zero-Shot Learners
2022ICLR 2022Jason Wei, Maarten Bosma, Vincent Y. Zhao et al.
2021

FLAN โ€” first formal large-scale definition of instruction tuning (Wei et al.)

breakthrough

Wei et al. from Google Research published 'Finetuned Language Models Are Zero-Shot Learners' (arXiv September 2021, ICLR 2022), introducing the term 'instruction tuning' and demonstrating that fine-tuning a 137B model on 62 NLP tasks verbalized as natural language instructions substantially improves zero-shot performance on unseen tasks.

2022

InstructGPT โ€” instruction tuning with human feedback (Ouyang et al., OpenAI)

breakthrough

Ouyang et al. (OpenAI) published 'Training language models to follow instructions with human feedback' (NeurIPS 2022), combining supervised instruction tuning with RLHF to produce InstructGPT models that are significantly preferred by human evaluators over base GPT-3 despite having far fewer parameters.

2022

Scaling instruction fine-tuning โ€” Flan-T5 and Flan-PaLM (Chung et al.)

breakthrough

Chung et al. published 'Scaling Instruction-Finetuned Language Models' (arXiv October 2022), demonstrating that scaling the number of instruction tasks to 1,836, adding chain-of-thought data, and using mixed prompting strategies dramatically improves performance across PaLM and T5 model families. Flan-T5 checkpoints were publicly released.

2023

Stanford Alpaca โ€” instruction tuning of open-source models using synthetic data

Stanford CRFM released Alpaca (March 2023), an instruction-tuned LLaMA 7B model trained on ~52k examples generated by GPT-3 text-davinci-003, demonstrating that effective instruction tuning is achievable with synthetic datasets at low cost for smaller open-source models.

GPU Tensor CoresPRIMARY

Instruction tuning is computationally equivalent to supervised fine-tuning of a large language model. GPU Tensor Cores are the dominant hardware for this workload, supporting mixed-precision (BF16/FP16) matrix multiplications required for transformer fine-tuning.

Full fine-tuning of large models (โ‰ฅ7B parameters) requires multi-GPU setups with adequate VRAM. PEFT methods (LoRA, QLoRA) reduce VRAM requirements substantially, enabling fine-tuning on single consumer or data center GPUs.

TPUGOOD

Large-scale instruction tuning experiments (e.g., Flan-PaLM 540B, Flan-T5) were conducted on TPU pods at Google. TPUs provide high throughput for large-scale SFT and are well-supported by JAX/Flax and PyTorch XLA.

TPU training requires frameworks with XLA support. Most open-source instruction tuning tooling (Hugging Face TRL, Axolotl) is primarily designed for GPU-based training.

Commonly used with

RLHF

Reinforcement Learning from Human Feedback (RLHF) is a multi-stage training pipeline used to align language models and other AI systems with human preferences and intent. The approach was formally introduced for deep RL in Christiano et al. (2017), and scaled to large language models in Ouyang et al. (2022) (InstructGPT), where it became the primary alignment technique for systems such as ChatGPT, Claude, and Gemini. The standard RLHF pipeline for LLMs consists of three sequential stages: 1. Supervised Fine-Tuning (SFT): A pretrained language model is fine-tuned on a curated dataset of high-quality (prompt, response) pairs produced by human annotators, yielding a base aligned policy ฯ€_SFT. 2. Reward Model Training: Human annotators compare pairs of model responses to the same prompt and express preferences (which response is better). These pairwise comparisons are used to train a scalar reward model r_ฯ†(x, y), typically using a Bradley-Terry model as the preference objective: loss = -E[log ฯƒ(r(x, y_w) - r(x, y_l))], where y_w is the preferred and y_l the rejected response. 3. RL Fine-Tuning via PPO: The SFT-initialized policy ฯ€_ฮธ is optimized with Proximal Policy Optimization (PPO) to maximize the reward from r_ฯ†, subject to a KL divergence penalty that prevents the policy from drifting too far from ฯ€_SFT: Objective(x, y) = r_ฯ†(x, y) โˆ’ ฮฒ ยท KL(ฯ€_ฮธ(y|x) || ฯ€_SFT(y|x)). The KL penalty with coefficient ฮฒ is critical to prevent reward hacking. During PPO training, four models are needed simultaneously: the active policy, a frozen reference policy (ฯ€_SFT), the reward model, and a value/critic network. This makes RLHF computationally expensive, requiring substantial GPU memory. A key limitation is reward hacking: since the reward model is a proxy for human preferences trained on finite data, the policy can find ways to exploit its imperfections โ€” generating outputs that score highly on the reward model but are degenerate or low-quality. The KL penalty is the primary mitigation mechanism. Direct Preference Optimization (DPO, Rafailov et al., 2023) was proposed as a mathematically equivalent simplification of RLHF that eliminates the explicit reward model and RL training loop, replacing them with a single supervised loss directly on preference pairs.

GO TO CONCEPT