The RaG pipeline operates in three stages: (1) the Generative Recommendation Model predicts a D-SID sequence representing the user's future interest based on their profile and interaction history; (2) the Instruction Model (Qwen3-8B) translates these D-SIDs (optionally enriched with metadata such as advertised product info) into structured shot-level instructions (scene composition, camera motion, pacing, cinematic style); (3) Video Generation Agents execute these instructions in a hierarchical pipeline of three specialized agents (visual, audio, effects) on a shared Qwen2.5-32B backbone, with a bounded reflection loop (capped at 2 iterations) and KV-cache reuse. The entire pipeline is optimised by SCRL with GDPO constrained policy optimization.
Traditional video recommendation systems, both classical DLRMs and the newer Generative Recommendation Models, are fundamentally constrained by a static pool of pre-produced clips. Even when user interest falls outside the existing content — especially for dynamic, long-tail and diverse preferences — the system can only select the nearest available clip. This leads to suboptimal matches in modern short-video platforms.
Multimodal encoder based on Qwen2.5-VL-7B-Instruct that produces two factorised video representations: content (entities, topics) and creative (style, rhythm, atmosphere). Each is independently quantised by RQ-K-means into a 4-layer code, 8,192 entries/layer. The resulting D-SIDs sequence = [content SIDs ; creative SIDs] forms the shared interface between recommendation and generation.
Autoregressive model predicting a D-SID sequence representing the user's future interest from their profile and interaction history: p(D-SIDs | user_context) = ∏ p(s_t | s_<t, user_context). Trained in streaming mode on interaction logs (impression, click, watch time, conversion) with periodic GDPO optimization.
Official
Language model based on Qwen3-8B that translates discrete D-SIDs (reconstructed via reverse RQ-K-means and projected through a learnable projector) into structured shot-level instructions: scene composition, camera motion, pacing, style. Trained in three stages: (1) projector training, (2) joint fine-tuning, (3) reward optimization. Supervision distilled from Gemini 2.5 Pro.
Official
Three role-specialized sub-agents — Visual Planning Agent (VPA), Audio Alignment Agent (AAA), Artistic Effect Enhancement Agent (AEEA) — operating sequentially over an evolving generation state. All share a single Qwen2.5-32B backbone, differentiated only by prompt and attention mask over the tool set. A bounded reflection loop, capped at 2 Observe→Think→Act iterations, ensures cross-modal consistency. KV-cache reuse across sub-agents dramatically reduces latency.
Reinforcement learning mechanism combining three heterogeneous signals: video quality (visual + audio + effect), interest alignment (instr-align + rep-align) and user feedback (real + predicted). Formulated as constrained policy optimization: user feedback is the primary objective, alignment and quality are constraints. Solved via GDPO (Group-decoupled normalization) with PID-controlled Lagrangian multipliers.
Official
Video generation is orders of magnitude slower than classical recommendation inference. Placing generation directly in the real-time pipeline is infeasible — it requires a decoupled architecture (nearline + cache).
Naively aggregating heterogeneous rewards (quality, alignment, feedback) causes one scale to dominate the others and destabilises training.
Without explicitly enforced orthogonality, content and creative representations can leak into each other, destroying the disentanglement.
Deep Learning Recommendation Models (Covington 2016, Wide & Deep, DIN etc.) — the retrieve-and-rank paradigm over a static candidate pool.
Introduction of Semantic IDs (Rajput et al. 2023) and the first GRMs modelling recommendation as autoregressive SID generation — yet still retrieving from a clip pool.
Efficient architectures for scalable GRMs in production environments — Deng et al. 2025, Xue et al. 2026 — pave the way for generation as a native paradigm.
Kuaishou Technology + Beihang University publish (arXiv 2606.25496, June 2026) the first production deployment of a system unifying recommendation and personalized video generation. Deployed on 400M+ DAU with +5.46% ad revenue lift over DLRM.
Stage-dependent activation: different sub-agents (visual/audio/effects) activate sequentially depending on the generation state.
Hierarchical serving strategy: Case 1 (content-SIDs hit) — return cached video or generate missing creative variations asynchronously; Case 2 (content-SIDs miss) — serve the nearest-neighbor SIDs video and enqueue a new generation with priority.
Real-time GRM (generating interest D-SIDs) is sequential per request but scalable across requests. Nearline video generation (IM + VGAs) runs in parallel for many D-SIDs, but within a single generation the VGA sub-agents run sequentially (VPA → AAA → AEEA), with a bounded reflection loop. KV-cache reuse between sub-agents partially amortises the sequentiality.
All key components (Qwen2.5-VL, Qwen3-8B, Qwen2.5-32B, GRM) are typical autoregressive/multimodal models whose training and inference are natively targeted at GPUs with tensor cores and frameworks like vLLM/SGLang.
The RaG paradigm itself is independent of any particular hardware family — it can be implemented on TPU, AWS Inferentia or other LLM/diffusion accelerators as long as scalable generative inference is available.