Architecture

RaG (Recommendation-as-Generation)

2026ResearchPublished: 25 June 2026Updated: 25 June 2026Published

Key innovation

Paradigm shift in recommendation: instead of retrieving and ranking clips from a pre-produced pool, the system generates personalized videos on demand from inferred user interest, unifying the recommendation model and the video generator through shared Disentangled Semantic IDs (D-SIDs).

How it works

The RaG pipeline operates in three stages: (1) the Generative Recommendation Model predicts a D-SID sequence representing the user's future interest based on their profile and interaction history; (2) the Instruction Model (Qwen3-8B) translates these D-SIDs (optionally enriched with metadata such as advertised product info) into structured shot-level instructions (scene composition, camera motion, pacing, cinematic style); (3) Video Generation Agents execute these instructions in a hierarchical pipeline of three specialized agents (visual, audio, effects) on a shared Qwen2.5-32B backbone, with a bounded reflection loop (capped at 2 iterations) and KV-cache reuse. The entire pipeline is optimised by SCRL with GDPO constrained policy optimization.

Problem solved

Traditional video recommendation systems, both classical DLRMs and the newer Generative Recommendation Models, are fundamentally constrained by a static pool of pre-produced clips. Even when user interest falls outside the existing content — especially for dynamic, long-tail and diverse preferences — the system can only select the nearest available clip. This leads to suboptimal matches in modern short-video platforms.

Key mechanisms

Disentangled Semantic IDs (D-SIDs) — factorising videos into content semantics + creative attributes via RQ-K-means quantization with separate codebooks (8,192 entries/layer, 4 hierarchical layers)

Generative interest prediction — autoregressive modelling of p(D-SIDs | user_context) instead of candidate scoring

Instruction Model as a semantic bridge between discrete D-SIDs and controllable generation (Qwen3-8B + projector trained in three stages)

Hierarchical multi-agent video generation with three sub-agents (visual, audio, effects) on a shared Qwen2.5-32B backbone with KV-cache reuse

Bounded reflection loop (capped at 2 Observe→Think→Act iterations) for cross-modal consistency while preserving latency

Synergistic Cross-Domain Reward Learning (SCRL) — constrained policy optimization with GDPO, user feedback as the primary objective, interest alignment + video quality as constraints with PID-controlled Lagrange multipliers

Decoupled deployment architecture: real-time GRM + nearline IM/VGAs + latency-aware serving with hierarchical SID-indexed cache

Strengths & limitations

Strengths

✓Goes beyond the limit of a finite video pool by generating content for arbitrary interest D-SIDs

✓Empirically validated in production on a 400M+ DAU platform with +5.46% revenue lift over DLRM and +1.87% over a strong GRM baseline

✓Disentangled content/creative is more structured and less prone to interference during autoregressive generation than monolithic SIDs

✓Hierarchical agentic structure enables specialization (visual/audio/effects) while sharing a backbone (parameter savings + KV-cache reuse)

✓SCRL with GDPO solves the practical problem of combining heterogeneous rewards (quality, alignment, feedback) without manually tuning magic numbers

✓Decoupled deployment (real-time + nearline) enables practical integration despite slow video generation

Limitations

✗Requires heavy compute — video generation is orders of magnitude slower than classical recommendation inference (hence the need for a nearline pipeline)

✗Direct applicability outside Kuaishou is not validated; public experiments are limited to the advertising scenario

✗Quality of the final video is bounded by the current state of generative models — still requires a 2-iteration reflection loop for cross-modal consistency

✗The need to train separate components (encoder, GRM, IM, VGAs) by distillation from a very strong teacher (Gemini 2.5 Pro in supervision construction) complicates replication

✗Personalising creative aspects of video for hundreds of millions of users requires aggressive caching (SID-indexed), which may reduce true personalisation for rare interest combinations

Components

Disentangled Semantic Video Encoders (D-SIDs)Unified latent interface — bridge between the recommendation model and the video generator

Multimodal encoder based on Qwen2.5-VL-7B-Instruct that produces two factorised video representations: content (entities, topics) and creative (style, rhythm, atmosphere). Each is independently quantised by RQ-K-means into a 4-layer code, 8,192 entries/layer. The resulting D-SIDs sequence = [content SIDs ; creative SIDs] forms the shared interface between recommendation and generation.

Generative Recommendation Model (GRM)Real-time interest modeling — low-latency generation of the user's interest D-SIDs

Autoregressive model predicting a D-SID sequence representing the user's future interest from their profile and interaction history: p(D-SIDs | user_context) = ∏ p(s_t | s_<t, user_context). Trained in streaming mode on interaction logs (impression, click, watch time, conversion) with periodic GDPO optimization.

Official

Instruction Model (IM)Semantic bridge between recommendation and controllable generation

Language model based on Qwen3-8B that translates discrete D-SIDs (reconstructed via reverse RQ-K-means and projected through a learnable projector) into structured shot-level instructions: scene composition, camera motion, pacing, style. Trained in three stages: (1) projector training, (2) joint fine-tuning, (3) reward optimization. Supervision distilled from Gemini 2.5 Pro.

Official

Video Generation Agents (VGAs)Hierarchical multi-agent video production — visual planning + audio alignment + artistic effects

Three role-specialized sub-agents — Visual Planning Agent (VPA), Audio Alignment Agent (AAA), Artistic Effect Enhancement Agent (AEEA) — operating sequentially over an evolving generation state. All share a single Qwen2.5-32B backbone, differentiated only by prompt and attention mask over the tool set. A bounded reflection loop, capped at 2 Observe→Think→Act iterations, ensures cross-modal consistency. KV-cache reuse across sub-agents dramatically reduces latency.

Synergistic Cross-Domain Reward Learning (SCRL)End-to-end optimization of the closed-loop generation-recommendation system

Reinforcement learning mechanism combining three heterogeneous signals: video quality (visual + audio + effect), interest alignment (instr-align + rep-align) and user feedback (real + predicted). Formulated as constrained policy optimization: user feedback is the primary objective, alignment and quality are constraints. Solved via GDPO (Group-decoupled normalization) with PID-controlled Lagrangian multipliers.

Official

Implementation

Reference implementations

Project page (Kuaishou)

Yanhua Cheng et al. (Kuaishou Technology)

Official

Implementation pitfalls

Video generation latency bottleneckCritical

Video generation is orders of magnitude slower than classical recommendation inference. Placing generation directly in the real-time pipeline is infeasible — it requires a decoupled architecture (nearline + cache).

Fix:Decoupled deployment: real-time GRM, nearline IM+VGAs, hierarchical SID-indexed cache, asynchronous queueing of missing creative variations.

Scale mismatch between heterogeneous rewardsHigh

Naively aggregating heterogeneous rewards (quality, alignment, feedback) causes one scale to dominate the others and destabilises training.

Fix:Constrained policy optimization (GDPO) with per-channel standardization and PID-controlled Lagrangian multipliers; thresholds calibrated from the baseline distribution (τ = μ_base + k·σ_base) with different k per component.

Cross-factor leakage between content and creative SIDsMedium

Without explicitly enforced orthogonality, content and creative representations can leak into each other, destroying the disentanglement.

Fix:Orthogonality constraint added to the loss: L_orth = ||z_content^T · z_creative||_2^2 added to the contrastive loss for each modality.

Evolution

Original paper · 2026 · arXiv preprint (cs.IR), 24 June 2026 · Yanhua Cheng

Recommendation as Generation: Unifying Personalized Video Generation and Recommendation at Industrial Scale

Yanhua Cheng, Bo Wang, Haotian Zhang, Xinyuan Gao, Peng Jiang, Kun Gai

2016

DLRMs as the recommendation standard

Deep Learning Recommendation Models (Covington 2016, Wide & Deep, DIN etc.) — the retrieve-and-rank paradigm over a static candidate pool.

2023

Semantic IDs and Generative Recommendation Models

Introduction of Semantic IDs (Rajput et al. 2023) and the first GRMs modelling recommendation as autoregressive SID generation — yet still retrieving from a clip pool.

2025

Industrial-scale GRM (e.g. Xue et al. 2026)

Efficient architectures for scalable GRMs in production environments — Deng et al. 2025, Xue et al. 2026 — pave the way for generation as a native paradigm.

2026

RaG — Recommendation-as-Generation (Kuaishou)

Inflection point

Kuaishou Technology + Beihang University publish (arXiv 2606.25496, June 2026) the first production deployment of a system unifying recommendation and personalized video generation. Deployed on 400M+ DAU with +5.46% ad revenue lift over DLRM.