Alignment

SCRL

2026ActivePublished: 25 June 2026Updated: 25 June 2026Published

Key innovation

A closed-loop multi-reward RL optimisation that combines three heterogeneous reward domains — video quality, interest alignment and user feedback — through constrained policy optimisation, where user feedback is the primary objective and alignment and quality are constraints with PID-controlled Lagrange multipliers.

How it works

The SCRL pipeline has four stages. Stage 1: Reward model setup — for each of the three domains a separate Transformer-based reward model is trained on task-specific data (visual quality, audio sync, effect alignment, instruction alignment, representation alignment). Stage 2: Threshold calibration — for each constraint reward R_c, μ_c^base and σ_c^base are computed on the SFT baseline distribution, and τ_c = μ_c + k_c × σ_c is set with module-specific k_c (1.1 for VGAs, 0.8 for IM, 0.3 for GRM). Stage 3: Constrained reward construction — composite reward R(y_i) = R_feedback(y_i) − Σ_c λ_c(t) × ReLU(τ_c − R_c(y_i)) with PID-controlled λ_c(t). Stage 4: GDPO optimisation — per-reward standardisation + group-relative batch normalisation + PPO-style policy update with importance sampling clipping and KL regularisation.

Problem solved

Naive combining of heterogeneous rewards in multi-reward RL (e.g. quality + alignment + feedback) suffers from three practical problems: (1) scale mismatch — different rewards have different orders of magnitude and one dominates the others; (2) not all rewards are equal — some are hard business objectives, others are quality/compliance constraints; (3) hand-tuned magic numbers (weights, thresholds) are fragile and impossible to generalise across modules. SCRL solves this through a constrained formulation (user feedback as primary, alignment and quality as constraints), PID-controlled Lagrangians and threshold calibration against the SFT baseline distribution.

Key mechanisms

Asymmetric objective formulation — primary objective (user feedback) vs constraints (alignment, quality) instead of a naive sum

Three synergistic reward domains — quality (visual+audio+effect), alignment (instr+rep), feedback (real+pred) with dedicated reward models per component

PID-controlled Lagrangian multipliers — λ_c(t) updated by a PID rule from constraint violations instead of a naive primal-dual update (Stooke et al. 2020)

Calibrated thresholds — τ_c = μ_c^base + k_c × σ_c^base relative to the SFT baseline distribution on a held-out validation set

Module-specific strictness factors — different k_c for VGAs (1.1), IM (0.8), GRM (0.3) reflecting each module's role in the pipeline

GDPO as the optimiser — per-reward standardisation eliminates the collapse of different reward combinations

Reward augmentation — combining sparse real signals (R_real) with dense predicted signals (R_pred) for sample efficiency

Strengths & limitations

Strengths

✓Validated in Kuaishou production (400M+ DAU) — +5.46% ad revenue vs DLRM, +1.87% vs GRM baseline

✓Eliminates hand-tuned magic numbers via threshold calibration against the baseline distribution

✓Asymmetric objective formulation reflects business reality — feedback is a hard KPI, alignment and quality are guarantees

✓PID-controlled Lagrangians ensure optimisation stability without oscillation

✓Module-specific strictness factors allow precise tuning to each component's role in the pipeline

✓Builds on the proven GDPO — leverages its per-reward normalisation that solves advantage value collapse

✓Combining sparse + dense rewards solves the practical reward sparsity problem in real-world RL

Limitations

✗Operational complexity — requires maintaining 7+ independent reward models (visual, audio, effect, instr-align, rep-align, real feedback, pred feedback)

✗Requires an SFT baseline distribution for threshold calibration — initial bootstrapping needs a separate baseline

✗Strictness factors k_c are domain-specific and still require designer decisions (though less arbitrary than raw thresholds)

✗Focused on video recommendation — generalisation to other domains (e.g. text generation, robotics) requires empirical validation

✗The PID controller for Lagrangians introduces additional hyperparameters (P, I, D coefficients) that require tuning

✗Direct replication outside Kuaishou is difficult — the full stack requires access to production user feedback data and dedicated reward models

Components

Video Quality RewardsConstraint reward — perceptual quality guarantee of the generated video

Reward component measuring generated video quality across three aspects: R_visual (aesthetics, spatio-temporal consistency), R_audio (TTS synchronisation, BGM coherence), R_effect (quality of subtitles, highlights, action bars). All aspects have dedicated Transformer-based reward models trained on task-specific data.

Official

Interest Alignment RewardsConstraint reward — semantic alignment guarantee with user intent

Reward component measuring the alignment of generated content with the user's interest D-SIDs from the GRM: R_instr-align (semantic consistency D-SIDs ↔ IM-generated instructions), R_rep-align (semantic similarity D-SIDs ↔ the finally generated video). Anchors personalisation on the user's structured intent.

Official

User Feedback RewardsPrimary objective — the financially and commercially most important optimisation signal

Reward component measuring actual user reaction: R_real (sparse but high-fidelity real interactions — click, like, collect, purchase), R_pred (dense engagement predictions from deployed ranking models, capturing preference strength beyond explicit interactions). The sparse + dense combination solves the reward sparsity problem.

PID-controlled Lagrangian MultipliersAdaptive weighting of constraint rewards based on their current violation

Time-varying λ_c(t) ≥ 0 for each constraint reward, updated by a PID-controlled rule (proportional + integral + derivative on constraint violations) instead of a naive primal-dual update. Eliminates the typical oscillation and overshoot in constrained policy optimisation (Stooke et al. 2020).

Official

Calibrated Thresholds with Module-Specific StrictnessAutomatic calibration of constraint thresholds against baseline statistics instead of manual tuning

Thresholds τ_c = μ_c^base + k_c × σ_c^base calibrated against the SFT baseline distribution on a held-out validation set, with module-specific strictness factor k_c: VGAs (1.1 for both τ_a and τ_q — strictest), IM (0.8 for τ_a), GRM (0.3 for τ_a, τ_q omitted). Eliminates hand-tuned magic numbers.

Implementation

Implementation pitfalls

Naive primal-dual updates of Lagrangians lead to oscillationHigh

Standard primal-dual updates of λ_c after constraint violations tend to overshoot and oscillate — λ_c grows too aggressively, then falls too sharply, destabilising training.

Fix:Use a PID-controlled update rule (Stooke et al. 2020) with proportional + integral + derivative components instead of a simple primal-dual.

Hand-tuned thresholds instead of calibrated onesMedium

Static, hand-tuned τ_c are fragile — they do not generalise across modules (VGAs vs IM vs GRM) and require re-tuning after every model change. The lack of a relationship to the baseline distribution leaves no intuition about constraint difficulty.

Fix:Calibrate τ_c = μ_c^base + k_c × σ_c^base against the SFT baseline distribution on a held-out validation set, with module-specific k_c.

Reward sparsity for R_realMedium

Real user feedback (R_real) is sparse and delayed — clicks/conversions happen rarely per sample. Naive use of R_real alone leads to an unstable, weak training signal.

Fix:Augment with R_pred (dense engagement predictions from existing ranking models) — R_feedback = R_real + R_pred improves sample efficiency without losing the high fidelity of R_real.

Failure to distinguish primary objective vs constraintsHigh

Treating all rewards as equal (naive sum or weighted sum) ignores that some are hard business KPIs (feedback) while others are quality guarantees — leading to suboptimal trade-offs when rewards conflict.

Fix:Asymmetric formulation: primary objective + inequality constraints instead of a symmetric sum. SCRL uses user feedback as primary, alignment and quality as constraints.

Evolution

Original paper · 2026 · arXiv 2606.25496 (Kuaishou Technology + Beihang University, June 2026), section 2.5 · Yanhua Cheng

Recommendation as Generation: Unifying Personalized Video Generation and Recommendation at Industrial Scale

Yanhua Cheng, Bo Wang, Haotian Zhang, Xinyuan Gao, Peng Jiang, Kun Gai

2017

PPO (OpenAI) — policy gradient foundation

Schulman et al. introduce Proximal Policy Optimization — the foundation of all later on-policy RL methods.

PPO (concept)

2020

PID Lagrangian Methods (Stooke et al.) — stable constrained RL

Inflection point

Stooke et al. publish 'Responsive Safety in Reinforcement Learning by PID Lagrangian Methods' — the direct technological foundation of constrained policy optimisation in SCRL.

2022

RLHF (InstructGPT) — popularisation of reward models

OpenAI publishes InstructGPT — the standard RLHF pipeline with a reward model + PPO. An inspiration for the multi-aspect reward models in SCRL.

RLHF (concept)

2024

GRPO (DeepSeek) — group-relative advantage

DeepSeek introduces Group Relative Policy Optimization — value-free policy optimisation via group-relative normalisation.

GRPO (concept)

2026

GDPO (NVIDIA, January 2026) — fixing multi-reward GRPO

Inflection point

Liu et al. (NVIDIA) introduce GDPO with per-reward decoupled normalisation — the direct optimisation building block in SCRL.

GDPO (concept)

2026

SCRL in RaG (Kuaishou, June 2026)

Inflection point

Kuaishou Technology + Beihang University combine GDPO + PID Lagrangians + multi-domain reward models into SCRL — the framework closing the end-to-end loop in the Recommendation-as-Generation paradigm. Production deployment on 400M+ DAU.

RaG (Recommendation-as-Generation) (concept)