Forward KL gives SFT-like matching (lower entropy); Reverse KL gives RL-like policy gradient (higher entropy, better exploration).
How many teacher trajectories to generate per prompt (Best-of-N in BRTS). N=1 is standard OPD; N>1 reduces signal variance.
Time complexity: O(T · L) per batch step.
Teacher signal is computed for every token of the generated sequence.
Student rollout is sequential (autoregressive), but teacher evaluation can be parallelized.
OPD requires simultaneous inference of student (rollout) and teacher (KL supervision) models — typically a large teacher (100B+) on a separate GPU node from the student.
For each prompt x: (1) student π_S generates a sequence ŷ ~ π_S(·|x) (on-policy rollout); (2) teacher π_T computes per-token log-probabilities for each token t in ŷ; (3) student minimizes KL loss: L = KL(π_T(·|x,ŷ_{<t}) || π_S(·|x,ŷ_{<t})) summed over tokens. Depending on KL direction: forward KL (teacherward) or reverse KL (studentward, gives a policy gradient effect). OPD provides dense signals at every token, contrasting with sparse outcome rewards in RL.
Classical (off-policy) knowledge distillation trains the student on teacher-generated or static sequences, causing exposure bias: at inference time the student generates errors it was never trained on. OPD resolves this by training the student on its own trajectories, eliminating the distribution mismatch between training and inference.
The student generates a token sequence for a given prompt — training occurs on these self-generated sequences.
A stronger model computing per-token log-probabilities for student-generated tokens — provides dense per-token signals.
Official
Forward KL: KL(π_T || π_S) — match teacher distribution; Reverse KL: KL(π_S || π_T) — gives policy gradient effect, supports exploration.
Official
GKD paper (ICLR 2024) formalizes OPD as a student-on-own-sequences distillation method, demonstrating effectiveness on summarization, translation, and arithmetic reasoning.
Wider adoption of OPD combined with RLHF and GRPO for mathematical reasoning models.
Series of works (SCOPE, BRTS, AOPD, SOD, dGRPO) establish OPD as a standard component of hybrid LLM post-training frameworks alongside GRPO/PPO.
HyperEyes (multimodal agents), DiffusionOPD (text-to-image), SOD (small LLM agents) apply OPD to new domains.
If the teacher is not substantially stronger than the student, its signals may teach incorrect behaviors.
Research (Jiang et al. 2026) shows ~18% of tokens exhibit persistently high loss despite OPD training (so-called Rock Tokens) — wasting optimization bandwidth.
Incorrect tool calls cascade through subsequent reasoning steps, progressively increasing student-teacher divergence (SOD paper, 2026).