The SCRL pipeline has four stages. Stage 1: Reward model setup โ for each of the three domains a separate Transformer-based reward model is trained on task-specific data (visual quality, audio sync, effect alignment, instruction alignment, representation alignment). Stage 2: Threshold calibration โ for each constraint reward R_c, ฮผ_c^base and ฯ_c^base are computed on the SFT baseline distribution, and ฯ_c = ฮผ_c + k_c ร ฯ_c is set with module-specific k_c (1.1 for VGAs, 0.8 for IM, 0.3 for GRM). Stage 3: Constrained reward construction โ composite reward R(y_i) = R_feedback(y_i) โ ฮฃ_c ฮป_c(t) ร ReLU(ฯ_c โ R_c(y_i)) with PID-controlled ฮป_c(t). Stage 4: GDPO optimisation โ per-reward standardisation + group-relative batch normalisation + PPO-style policy update with importance sampling clipping and KL regularisation.
Naive combining of heterogeneous rewards in multi-reward RL (e.g. quality + alignment + feedback) suffers from three practical problems: (1) scale mismatch โ different rewards have different orders of magnitude and one dominates the others; (2) not all rewards are equal โ some are hard business objectives, others are quality/compliance constraints; (3) hand-tuned magic numbers (weights, thresholds) are fragile and impossible to generalise across modules. SCRL solves this through a constrained formulation (user feedback as primary, alignment and quality as constraints), PID-controlled Lagrangians and threshold calibration against the SFT baseline distribution.
Reward component measuring generated video quality across three aspects: R_visual (aesthetics, spatio-temporal consistency), R_audio (TTS synchronisation, BGM coherence), R_effect (quality of subtitles, highlights, action bars). All aspects have dedicated Transformer-based reward models trained on task-specific data.
Official
Reward component measuring the alignment of generated content with the user's interest D-SIDs from the GRM: R_instr-align (semantic consistency D-SIDs โ IM-generated instructions), R_rep-align (semantic similarity D-SIDs โ the finally generated video). Anchors personalisation on the user's structured intent.
Official
Reward component measuring actual user reaction: R_real (sparse but high-fidelity real interactions โ click, like, collect, purchase), R_pred (dense engagement predictions from deployed ranking models, capturing preference strength beyond explicit interactions). The sparse + dense combination solves the reward sparsity problem.
Time-varying ฮป_c(t) โฅ 0 for each constraint reward, updated by a PID-controlled rule (proportional + integral + derivative on constraint violations) instead of a naive primal-dual update. Eliminates the typical oscillation and overshoot in constrained policy optimisation (Stooke et al. 2020).
Official
Thresholds ฯ_c = ฮผ_c^base + k_c ร ฯ_c^base calibrated against the SFT baseline distribution on a held-out validation set, with module-specific strictness factor k_c: VGAs (1.1 for both ฯ_a and ฯ_q โ strictest), IM (0.8 for ฯ_a), GRM (0.3 for ฯ_a, ฯ_q omitted). Eliminates hand-tuned magic numbers.
Standard primal-dual updates of ฮป_c after constraint violations tend to overshoot and oscillate โ ฮป_c grows too aggressively, then falls too sharply, destabilising training.
Static, hand-tuned ฯ_c are fragile โ they do not generalise across modules (VGAs vs IM vs GRM) and require re-tuning after every model change. The lack of a relationship to the baseline distribution leaves no intuition about constraint difficulty.
Real user feedback (R_real) is sparse and delayed โ clicks/conversions happen rarely per sample. Naive use of R_real alone leads to an unstable, weak training signal.
Treating all rewards as equal (naive sum or weighted sum) ignores that some are hard business KPIs (feedback) while others are quality guarantees โ leading to suboptimal trade-offs when rewards conflict.
Schulman et al. introduce Proximal Policy Optimization โ the foundation of all later on-policy RL methods.
Stooke et al. publish 'Responsive Safety in Reinforcement Learning by PID Lagrangian Methods' โ the direct technological foundation of constrained policy optimisation in SCRL.
OpenAI publishes InstructGPT โ the standard RLHF pipeline with a reward model + PPO. An inspiration for the multi-aspect reward models in SCRL.
DeepSeek introduces Group Relative Policy Optimization โ value-free policy optimisation via group-relative normalisation.
Liu et al. (NVIDIA) introduce GDPO with per-reward decoupled normalisation โ the direct optimisation building block in SCRL.
Kuaishou Technology + Beihang University combine GDPO + PID Lagrangians + multi-domain reward models into SCRL โ the framework closing the end-to-end loop in the Recommendation-as-Generation paradigm. Production deployment on 400M+ DAU.
Factor defining how restrictive the constraint threshold is relative to baseline std. Larger k_c = harder to meet the constraint = higher pressure on alignment/quality.
How to decompose the overall optimisation goal into primary objective vs constraints. RaG chooses: user feedback = primary, alignment + quality = constraints.
Proportional, Integral, Derivative coefficients of the PID controller for ฮป_c(t) updates. Affect the speed of reaction to constraint violations and stability.
Number and specialisation of reward models in each domain. RaG uses 7+ independent Transformer-based reward models trained on task-specific data.
Stage-dependent in two dimensions: (1) reward routing per RaG module, (2) ฮป_c(t) changes dynamically during training.
Each module in RaG (GRM, IM, VGAs) is trained separately with a different subset of rewards and different strictness k_c. VGAs see all rewards (quality + alignment + feedback), IM sees alignment + feedback (without quality), GRM sees only alignment + feedback with the most relaxed k_a.
Reward model computations are independent and can be parallelised (one reward model per device). Per-reward standardisation and summation are lightweight. The bottleneck remains the rollout phase and the policy update itself โ not the composite reward construction.
All SCRL components (reward models, GDPO optimiser, policy update) are standard Transformer/MLP workloads on tensor cores. Kuaishou deploys it in production on NVIDIA GPU clusters.
The framework itself is hardware-agnostic โ it only requires hardware supporting standard policy optimisation operations and Transformer reward models. Can be deployed on TPU, AMD MI300 etc.