Mathematical derivation (the paper's key contribution): RLHF maximises E[r(x,y)] - β·KL(π||π_ref), where r is the reward model, π_ref is the reference model (usually SFT), β is the KL-regularisation strength. The optimal policy of this maximisation has a closed form: π*(y|x) = (1/Z(x))·π_ref(y|x)·exp(r(x,y)/β). Inverting: r(x,y) = β·log(π*(y|x)/π_ref(y|x)) + β·log Z(x). Substituting this into the Bradley-Terry preference model P(y_w > y_l) = σ(r(x,y_w) - r(x,y_l)), Z(x) cancels and we get a DIRECT loss without a reward model:
L_DPO = -E[(x,y_w,y_l)] log σ(β log π_θ(y_w|x)/π_ref(y_w|x) - β log π_θ(y_l|x)/π_ref(y_l|x))
Where π_θ is the trained policy (LLM), π_ref is the frozen SFT model. Training: standard backprop on batches of (prompt, chosen, rejected) triples; π_θ starts from π_ref; β controls the strength of departure from π_ref (typically 0.01–0.5). Inference: a plain LLM, no runtime overhead. Empirically requires 1–3 epochs on high-quality preference data — versus days of RL with PPO.
Classical RLHF has three steps: (1) supervised fine-tuning, (2) training a separate reward model on preference data, (3) optimising the LLM policy via PPO against that reward model. Each step adds cost, risks instability (PPO can be finicky), and accumulates errors: the reward model overfits and is hacked by the policy (reward hacking), and the RL optimisation itself requires careful tuning of KL penalty, learning rate, and quality control. DPO removes steps (2) and (3) — one supervised training run on (y_w, y_l) pairs replaces the whole RL pipeline, eliminates reward hacking, and dramatically lowers the entry barrier (a regular ML team instead of RL specialists).
The LLM updated by the DPO loss. Initialised from π_ref. After training it is the final product — used standardly at inference, with no trace of DPO.
Usually the SFT model — the starting version of π_θ before DPO. Stays frozen throughout training and serves as the denominator in the log-ratio: log(π_θ/π_ref). Implementation-wise: the same model in eval mode, a second forward pass per batch.
Official
Sigmoid binary cross-entropy on the difference of log-ratios for the (chosen, rejected) pair. The direct equivalent of a Bradley-Terry classifier, without a separate reward model.
Official
A set of response pairs with preference labels — y_w (winning) preferred over y_l (losing) for a given prompt x. May come from humans, LLM-as-judge, or be synthetic.
DPO tolerates high learning rates much worse than SFT — π_θ quickly drifts from the π_ref distribution and the model loses coherence/generation quality. A frequently reported issue.
(y_w, y_l) pairs with unclear quality difference inject noise that DPO directly learns. "Noisy" preference data is worse for DPO than for RLHF (the reward model filters some noise).
π_ref MUST be the starting model of π_θ (usually SFT), not a different checkpoint. Otherwise KL regularisation makes no sense and the model breaks.
DPO overfits much faster than SFT. After 3–5 epochs quality starts dropping even though loss keeps decreasing (overfitting in preference space).
Proximal Policy Optimization — the RL algorithm that becomes the default RLHF optimiser.
OpenAI publishes the full RLHF pipeline (SFT → reward model → PPO) applied to GPT-3.5. The LLM alignment standard for the next 18 months.
Anthropic replaces human preference labels with LLM-as-judge. One legacy of this idea will later be synthetic preference data for DPO.
Rafailov, Sharma, Mitchell, Ermon, Manning, Finn publish DPO (arXiv:2305.18290, NeurIPS 2023). They prove a formal equivalence: the optimal policy of KL-constrained RLHF parametrises its own implicit reward model. The result — a single supervised loss replaces the whole RLHF pipeline.
Hugging Face Zephyr 7B and Allen AI Tulu 2 demonstrate that SFT+DPO yields RLHF-competitive alignment quality at a fraction of the cost. The open-source community adopts en masse.
Azar et al. (IPO) and Ethayarajh et al. (KTO) introduce variants addressing overfitting and the pair requirement. Hong et al. (ORPO) merge SFT+DPO into a single loss. Meng et al. (SimPO) drop π_ref. DPO becomes part of a broader "direct preference optimisation" field.
Meta Llama 3, Mistral, and Alibaba Qwen use DPO (or its variants) as the main chat-alignment mechanism. RLHF with PPO remains mainly in OpenAI/Anthropic internal pipelines.
Time complexity: O(2 · T · |θ|) per krok (dwa forward passes per para). Space complexity: O(2 · |θ|) parametrów (π_θ + π_ref) lub O(|θ| + |LoRA|) z adapterem.
For 7B+ models, keeping both π_θ and π_ref in VRAM at once is the main constraint. Hence the popularity of DPO+LoRA and variants without π_ref (SimPO, ORPO).
Controls how far policy π_θ can drift from reference model π_ref. Too small β = drift from π_ref and quality degradation; too large = no alignment. Practical range 0.01–0.5; the original paper uses 0.1–0.3.
Frozen base model (usually SFT on the same instruction set). Determines the starting point and the KL-regularisation range. The choice of π_ref affects final quality more than β.
The single strongest factor in the outcome. (Chosen, rejected) pairs must have clearly different quality; noise in preferences hurts the model. Typical datasets: Anthropic HH-RLHF, UltraFeedback, Nectar, OpenAssistant.
DPO overfits faster than SFT — typically 1–3 epochs suffice. The paper recommends 1 epoch on large datasets.
Drastically lower than in SFT — typically 1e-7 to 5e-6. Too high = π_θ drifts from π_ref and loses generative capability (mode collapse).
After DPO a family of variants emerged: vanilla DPO (sigmoid), IPO (regularised against overfitting), KTO (pointwise preferences instead of pairs), ORPO (joint SFT+DPO), SimPO (without π_ref).
DPO modifies only the training loss function, not the model structure. The π_θ policy remains a dense network; π_ref is frozen but also dense.
The DPO loss is a standard supervised cross-entropy with two forward passes (chosen + rejected) per pair. It scales like SFT via DDP/FSDP, ZeRO, tensor parallelism — without special RL constraints. Inference after DPO is a plain LLM forward pass.
Standard supervised training on GPU — same profile as SFT. Two forward passes (chosen + rejected) per pair; π_ref is frozen, so only inference for it.
TRL and Axolotl support TPU/JAX for DPO. With no specific RL requirements, scaling is straightforward.
DPO is a pure loss modification — works on any hardware supporting SFT.