Alignment

GDPO

2026ActivePublished: 25 June 2026Updated: 25 June 2026Published

Key innovation

Decoupled per-reward normalization in multi-reward RL: each reward is standardised separately before summation, and only then is group-relative batch normalization applied. Eliminates the collapse of different reward combinations to identical advantage values that affects GRPO in multi-reward settings.

How it works

The GDPO pipeline has four stages. Stage 1: Rollout — policy πθ generates a group of K candidates {y1, …, yK} for input x. Stage 2: Per-reward standardisation — for each reward R_j and each candidate y_i, a normalised scalar r'_ji = (R_j(y_i) − μ_j) / (σ_j + ε) is computed, where μ_j and σ_j are computed independently for each reward channel within the group. Stage 3: Sum and batch-normalize — normalised rewards are summed per candidate (s_i = Σ_j r'_ji), then a second group-relative normalisation is applied to the summed signal: A_i = (s_i − μ_s) / (σ_s + ε). Stage 4: PPO-style policy update — the model is updated with the standard formula L = -E[A_i × log(πθ(y_i|x)/πref(y_i|x))], with importance sampling clipping and KL regularisation against πref.

Problem solved

Standard GRPO (and before it PPO/RLHF) was designed for a single scalar reward. When modern RL pipelines need to simultaneously optimise multiple heterogeneous rewards (e.g. correctness + format + length), naive summation and normalisation collapses different reward combinations to identical advantage values. This reduces the resolution of the training signal, causes suboptimal convergence and sometimes early training failure. GDPO solves this with decoupled normalisation — each reward is standardised separately before being summed.

Key mechanisms

Per-reward standardisation — independent computation of μ_j and σ_j for each reward within a group of candidates

Decoupled sum — rewards are summed ONLY after per-reward normalisation (the key difference from GRPO)

Group-relative batch normalisation — a second normalisation of the summed advantage signal against group statistics

Preservation of fine-grained reward combination differences — different reward combinations yield different advantages (vs collapse in GRPO)

PPO-style policy update with importance sampling clipping and KL regularisation (retained from PPO/GRPO)

Optional Lagrangian multipliers with PID-controlled updates for constraint-based formulation (as in RaG)

Per-channel ε (numerical stability) — a small epsilon added to σ to avoid division by zero when a reward is constant within a group

Strengths & limitations

Strengths

✓Consistently outperforms GRPO across all tested tasks (tool calling, math, coding) and all metrics (correctness + constraint adherence)

✓Substantially improved training stability — critical for long multi-hour RL training runs on large models

✓Eliminates the collapse of different reward combinations to identical advantage values — preserves full training signal resolution

✓Drop-in replacement for GRPO — minimal code changes (confirmed by the ms-swift PR author: 'minor modifications to the GRPO codebase')

✓Naturally scales to any number of rewards (not just 2-3) without additional tuning

✓Open-source implementation in ms-swift for the Megatron framework — available to the community

✓Producer: NVIDIA — strong tech reputation + direct implementations in NVIDIA's training stack

Limitations

✗Requires computing statistics (μ, σ) per channel — a small computational overhead over GRPO, though still negligible compared to rollout cost

✗No benefit for single-reward scenarios — in the single-reward case GDPO degenerates effectively into GRPO

✗Recent (January 2026) — community adoption is only emerging, with a smaller base of practical tuning guidance

✗Potential loss of information about relative magnitudes between different rewards (per-reward standardisation flattens channel scales) — in some scenarios explicit reward weighting may be needed

✗Requires a good reward distribution within the group of candidates — if the group is too small or homogeneous, normalisation can be numerically unstable

✗No empirical validation beyond the tasks tested by NVIDIA — generalisation to other domains (e.g. robotics, world models) requires validation

Components

Per-reward StandardisationPreserving training signal resolution between different reward combinations

For each reward R_j, μ_j and σ_j are computed independently within a group of K candidates. Each raw reward is then standardised: r'_ji = (R_j(y_i) − μ_j) / (σ_j + ε). This is the key difference from GRPO, where all rewards are summed before any normalisation.

Decoupled Sum + Group-Relative Batch NormalisationStabilising optimisation through group-relative scaling of the advantage signal

After per-reward normalisation, the normalised values are summed per candidate: s_i = Σ_j r'_ji. A second normalisation is then applied — group-relative batch normalisation on the summed signal: A_i = (s_i − μ_s) / (σ_s + ε). The resulting A_i is the advantage used in the policy update.

PPO-style Policy UpdateStandard policy optimisation — retained unchanged from GRPO/PPO

Standard PPO/GRPO policy update formula with importance sampling clipping and KL regularisation against a frozen reference policy πref: L = -E[A_i × min(ratio, clip(ratio, 1-ε, 1+ε))] + β × KL(πθ || πref), where ratio = πθ(y_i|x) / πref(y_i|x). This layer is unchanged from GRPO — the difference lies solely in how A_i is computed.

Official

Optional Lagrangian MultipliersConverting a multi-reward problem to constrained optimisation (primary objective + inequality constraints)

Optional extension for the constrained formulation in which some rewards are treated as constraints with target thresholds τ_c. The composite reward then becomes: R(y_i) = R_primary(y_i) − Σ_c λ_c(t) × ReLU(τ_c − R_c(y_i)), where λ_c(t) are PID-controlled time-varying Lagrange multipliers. This pattern is used in Kuaishou's RaG/SCRL.

Official

Implementation

Reference implementations

ms-swift (modelscope) — Megatron GDPO trainer

Python · Auraithm (PR author) / modelscope (project maintainers)

Implementation pitfalls

Group size K too smallHigh

Per-reward standardisation requires reasonable μ_j, σ_j statistics within a group. For small K (e.g. K=2, 4) the statistics are unstable — within-group variance can be zero or very small, leading to exploding advantage values or NaN.

Fix:Use K ≥ 8 (recommended 16-32), add ε in the denominator (~1e-6), monitor σ_j during training and log warnings when below a threshold.

Loss of scale between different rewards after normalisationMedium

Per-reward standardisation removes information about relative magnitudes between different rewards. If one reward is an order of magnitude more important than another, GDPO has no way to express this — all are scaled to a similar range.

Fix:Explicit reward weighting after normalisation: s_i = Σ_j w_j × r'_ji with learned or manually set w_j. RaG uses Lagrangian multipliers as dynamic weighting per constraint.

Naive combination with constraint-based formulationMedium

For constrained problems (constraint thresholds + Lagrangian multipliers), naively applying GDPO without PID-controlled Lagrangian updates can lead to oscillation and overshoot — λ_c grows too aggressively after a constraint violation, then overcorrects.

Fix:Use the PID-controlled Lagrangian update rule (Stooke et al. 2020) instead of a simple primal-dual update. The pattern used in Kuaishou's RaG.

Evolution

Original paper · 2026 · NVIDIA Tech Report (arXiv 2601.05242), 8 January 2026 · Shih-Yang Liu

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Peter Belcak, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Chiang Frank Wang, Kwang-Ting Cheng, Yejin Choi, Jan Kautz, Pavlo Molchanov

2017

PPO (OpenAI) — policy gradient with importance sampling foundation

Schulman et al. introduce Proximal Policy Optimization — the foundation of all later on-policy RL methods for LLMs, including GRPO and GDPO.

PPO (concept)

2022

RLHF (InstructGPT) — aligning LLMs with preferences

OpenAI publishes InstructGPT using RLHF (reward model + PPO) to align LLMs with human annotator preferences — the standard post-training pipeline.

RLHF (concept)

2024

GRPO (DeepSeek) — group-relative advantage without a value model

Inflection point

DeepSeek introduces Group Relative Policy Optimization — eliminates the value model by using group-relative advantage normalisation. A critical precursor to GDPO.

GRPO (concept)

2026

GDPO (NVIDIA, January 2026) — fixing multi-reward GRPO

Inflection point

Liu et al. (NVIDIA Tech Report, arXiv 2601.05242) introduce GDPO — a direct extension of GRPO with decoupled per-reward normalisation. Solves the collapse of different reward combinations to identical advantage values in multi-reward settings.

2026

ms-swift integration (PR #7348, January 2026)

GDPO lands in ms-swift (modelscope) as an official rlhf_type option alongside GRPO, giving the community an open-source implementation for the Megatron training framework.

2026

RaG (Kuaishou, June 2026) — GDPO in the Recommendation-as-Generation paradigm

Kuaishou Technology uses GDPO as the core of Synergistic Cross-Domain Reward Learning (SCRL) in the production Recommendation-as-Generation paradigm. The first large-scale production deployment of GDPO (400M+ DAU).

RaG (Recommendation-as-Generation) (concept)