The SFT dataset contains (prompt p, response y) pairs. The loss is L = -sum log P(y_t | p, y_<t). The model is trained with gradient descent on these pairs, typically with a small learning rate. Techniques like LoRA or QLoRA are often used to reduce compute costs. Data may come from human annotators (e.g. FLAN, Dolly) or be synthetically generated by a stronger model.
Pre-trained models are good at text completion but not at following user instructions, answering questions in chat format, or generating safe and helpful responses.
SFT on a narrow dataset can cause the model to forget previously learned capabilities. Use diverse datasets or regularization.
With too few examples or too many epochs, the model memorizes demonstrations rather than generalizing.
Noisy, inconsistent, or biased SFT data is directly reflected in model behavior. Quality > quantity.
Radford et al. and Devlin et al. establish the pre-train/fine-tune paradigm.
Wei et al. show that fine-tuning on diverse instruction datasets improves zero-shot performance.
Ouyang et al. formalize SFT as the first step before reward modeling and PPO.
Hu et al. (LoRA) and Dettmers et al. (QLoRA) enable SFT on consumer hardware by training only low-rank adapters.
Fine-tuning requires GPU for gradient computation on large models — minimum 1× A100 80GB for 7B models, multi-GPU for 70B+.
TPU v4/v5 used by Google and large orgs for SFT on 100B+ models due to high HBM throughput.