Training

OPD

2023ActivePublished

Key innovation

Trains a student model on its own self-generated sequences using dense token-level feedback from a teacher model, eliminating the distribution mismatch inherent in off-policy knowledge distillation.

How it works

For each prompt x: (1) student π_S generates a sequence ŷ ~ π_S(·|x) (on-policy rollout); (2) teacher π_T computes per-token log-probabilities for each token t in ŷ; (3) student minimizes KL loss: L = KL(π_T(·|x,ŷ_{<t}) || π_S(·|x,ŷ_{<t})) summed over tokens. Depending on KL direction: forward KL (teacherward) or reverse KL (studentward, gives a policy gradient effect). OPD provides dense signals at every token, contrasting with sparse outcome rewards in RL.

Problem solved

Classical (off-policy) knowledge distillation trains the student on teacher-generated or static sequences, causing exposure bias: at inference time the student generates errors it was never trained on. OPD resolves this by training the student on its own trajectories, eliminating the distribution mismatch between training and inference.

Components

On-Policy Rollout (student trajectory generation)Source of training trajectories without exposure bias

The student generates a token sequence for a given prompt — training occurs on these self-generated sequences.

Teacher Language ModelSource of dense corrective signals

A stronger model computing per-token log-probabilities for student-generated tokens — provides dense per-token signals.

Official

KL Loss (Forward or Reverse)Optimization objective — student-teacher divergence measure

Forward KL: KL(π_T || π_S) — match teacher distribution; Reverse KL: KL(π_S || π_T) — gives policy gradient effect, supports exploration.

Official

Implementation

Reference implementations

GKD (Generalized Knowledge Distillation) — original implementation

Python · Google DeepMind

Official

Implementation pitfalls

Weak teacher — incorrect corrective signalsHigh

If the teacher is not substantially stronger than the student, its signals may teach incorrect behaviors.

Fix:Use a teacher at least 2–10× larger or filter trajectories by a quality threshold.

Rock Tokens — stagnation of high-loss tokensMedium

Research (Jiang et al. 2026) shows ~18% of tokens exhibit persistently high loss despite OPD training (so-called Rock Tokens) — wasting optimization bandwidth.

Fix:Apply selective token weighting (skipping Rock Tokens) or SCOPE dual-path weighting.

Cascading tool errors in agentsHigh

Incorrect tool calls cascade through subsequent reasoning steps, progressively increasing student-teacher divergence (SOD paper, 2026).

Fix:Apply step-wise reweighting (SOD) or on-policy rollout restarts after erroneous steps.

Evolution

Original paper · 2023 · ICLR 2024 · Rishabh Agarwal

On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, Olivier Bachem

2023

GKD / On-Policy Distillation introduced (Agarwal et al., Google DeepMind)

Inflection point

GKD paper (ICLR 2024) formalizes OPD as a student-on-own-sequences distillation method, demonstrating effectiveness on summarization, translation, and arithmetic reasoning.

2024

OPD popularized in reasoning model post-training

Wider adoption of OPD combined with RLHF and GRPO for mathematical reasoning models.

2025

OPD dominates as supplement to sparse RL rewards

Inflection point

Series of works (SCOPE, BRTS, AOPD, SOD, dGRPO) establish OPD as a standard component of hybrid LLM post-training frameworks alongside GRPO/PPO.

2026

OPD extended to multimodal agents and diffusion models

HyperEyes (multimodal agents), DiffusionOPD (text-to-image), SOD (small LLM agents) apply OPD to new domains.