The GDPO pipeline has four stages. Stage 1: Rollout — policy πθ generates a group of K candidates {y1, …, yK} for input x. Stage 2: Per-reward standardisation — for each reward R_j and each candidate y_i, a normalised scalar r'_ji = (R_j(y_i) − μ_j) / (σ_j + ε) is computed, where μ_j and σ_j are computed independently for each reward channel within the group. Stage 3: Sum and batch-normalize — normalised rewards are summed per candidate (s_i = Σ_j r'_ji), then a second group-relative normalisation is applied to the summed signal: A_i = (s_i − μ_s) / (σ_s + ε). Stage 4: PPO-style policy update — the model is updated with the standard formula L = -E[A_i × log(πθ(y_i|x)/πref(y_i|x))], with importance sampling clipping and KL regularisation against πref.
Standard GRPO (and before it PPO/RLHF) was designed for a single scalar reward. When modern RL pipelines need to simultaneously optimise multiple heterogeneous rewards (e.g. correctness + format + length), naive summation and normalisation collapses different reward combinations to identical advantage values. This reduces the resolution of the training signal, causes suboptimal convergence and sometimes early training failure. GDPO solves this with decoupled normalisation — each reward is standardised separately before being summed.
For each reward R_j, μ_j and σ_j are computed independently within a group of K candidates. Each raw reward is then standardised: r'_ji = (R_j(y_i) − μ_j) / (σ_j + ε). This is the key difference from GRPO, where all rewards are summed before any normalisation.
After per-reward normalisation, the normalised values are summed per candidate: s_i = Σ_j r'_ji. A second normalisation is then applied — group-relative batch normalisation on the summed signal: A_i = (s_i − μ_s) / (σ_s + ε). The resulting A_i is the advantage used in the policy update.
Standard PPO/GRPO policy update formula with importance sampling clipping and KL regularisation against a frozen reference policy πref: L = -E[A_i × min(ratio, clip(ratio, 1-ε, 1+ε))] + β × KL(πθ || πref), where ratio = πθ(y_i|x) / πref(y_i|x). This layer is unchanged from GRPO — the difference lies solely in how A_i is computed.
Official
Optional extension for the constrained formulation in which some rewards are treated as constraints with target thresholds τ_c. The composite reward then becomes: R(y_i) = R_primary(y_i) − Σ_c λ_c(t) × ReLU(τ_c − R_c(y_i)), where λ_c(t) are PID-controlled time-varying Lagrange multipliers. This pattern is used in Kuaishou's RaG/SCRL.
Official
Per-reward standardisation requires reasonable μ_j, σ_j statistics within a group. For small K (e.g. K=2, 4) the statistics are unstable — within-group variance can be zero or very small, leading to exploding advantage values or NaN.
Per-reward standardisation removes information about relative magnitudes between different rewards. If one reward is an order of magnitude more important than another, GDPO has no way to express this — all are scaled to a similar range.
For constrained problems (constraint thresholds + Lagrangian multipliers), naively applying GDPO without PID-controlled Lagrangian updates can lead to oscillation and overshoot — λ_c grows too aggressively after a constraint violation, then overcorrects.
Schulman et al. introduce Proximal Policy Optimization — the foundation of all later on-policy RL methods for LLMs, including GRPO and GDPO.
OpenAI publishes InstructGPT using RLHF (reward model + PPO) to align LLMs with human annotator preferences — the standard post-training pipeline.
DeepSeek introduces Group Relative Policy Optimization — eliminates the value model by using group-relative advantage normalisation. A critical precursor to GDPO.
Liu et al. (NVIDIA Tech Report, arXiv 2601.05242) introduce GDPO — a direct extension of GRPO with decoupled per-reward normalisation. Solves the collapse of different reward combinations to identical advantage values in multi-reward settings.
GDPO lands in ms-swift (modelscope) as an official rlhf_type option alongside GRPO, giving the community an open-source implementation for the Megatron training framework.
Kuaishou Technology uses GDPO as the core of Synergistic Cross-Domain Reward Learning (SCRL) in the production Recommendation-as-Generation paradigm. The first large-scale production deployment of GDPO (400M+ DAU).
Number of candidates K in a group rollout. Larger K = better per-reward statistics, but higher compute cost.
Weight of KL regularisation against πref. Retained from PPO/GRPO. Larger β = slower drift from the reference policy.
PPO-style clip epsilon for the importance sampling ratio. Retained from PPO.
Optional (only for the constrained formulation). Target thresholds for each constraint reward. RaG calibrates them against the SFT baseline distribution as τ_c = μ_c + k_c × σ_c.
GDPO is an optimisation algorithm, not a model architecture modification. All reward channels are always active and normalised independently — there is no conditional routing or sparse activation.
Per-reward normalisation and summation are lightweight statistical operations, fully parallelised within a group of candidates (on GPU) and across multiple groups (across devices). The bottleneck remains the rollout phase (forward passes of the policy for K candidates), not the GDPO optimisation itself.
GDPO is a drop-in replacement for GRPO/PPO — all operations are standard GEMMs and reduce operations on GPU. NVIDIA developed the method with its own training stack (Megatron) in mind.
The algorithm is hardware-agnostic — it can be implemented on any hardware supporting standard policy optimisation operations (TPU, Habana, AMD MI300 etc.).