Kuaishou Turns Its Recommendation Engine into a Video Generator

Kuaishou has published research describing RaG (Recommendation-as-Generation) — a system that redesigns the logic of video recommendation from scratch: instead of searching existing content, it generates personalized ad videos directly from a user's interest profile. Already deployed in production across more than 400 million daily users, the system delivered a +1.870% ad revenue lift in A/B tests against the previous GRM baseline.

Key takeaways

RaG replaces the "retrieve-and-rank" model with a new approach: "predict interests, then generate video"
The system consists of 5 modules: D-SIDs, GRM, Instruction Model, Video Generation Agents, SCRL
Deployed in Kuaishou's advertising system serving 400M+ daily active users
Full system: +5.462% ad revenue vs. DLRM baseline, +1.870% vs. GRM baseline
Paper available on arXiv (2606.25496), project page: recommendation-as-generation.github.io

From "retrieve-and-rank" to "generate-and-serve"

For the past decade, every major video recommendation system has operated on a single blueprint: a user arrives, the system estimates their interests, then searches a content library for the best-matching videos. The "retrieve-and-rank" model powers TikTok, YouTube and Kuaishou itself — but it carries a fundamental constraint: it can only recommend what already exists.

The paper "Recommendation as Generation: Unifying Personalized Video Generation and Recommendation at Industrial Scale," published on arXiv (2606.25496), describes a system that inverts this logic entirely. Rather than asking "which video best fits this user?", researchers from Kuaishou and Beihang University ask: "what should a video look like if it were perfectly tailored to this user?" — and then generate it on the spot.

This is not a lab prototype. RaG runs today inside Kuaishou's advertising system. In A/B tests, the full system delivered a +5.462% revenue lift over the traditional DLRM baseline and +1.870% over the stronger GRM model — a difference that translates to concrete revenue at a scale of hundreds of millions of users.

Five modules, one loop

D-SIDs: dual-channel video identity

D-SIDs (Disentangled Semantic IDs) solve the problem of building a shared representation for both recommendation and video generation. A single ad video does not carry a single semantic meaning — the same product can be filmed as a soft lifestyle short or an aggressive sales ad. The authors split video representation into two independent channels: Content SIDs (what the video shows: product, people, actions) and Creative SIDs (how the video looks: style, pacing, atmosphere, camera expression). The representation is built on Qwen2.5-VL-7B-Instruct, quantized via RQ-KMeans into a 2-layer codebook with 8,192 codes per layer.

The results are measurable: the collision rate in semantic space dropped from 18.24% (QARM) to 2.62%, and semantic retrieval precision (R@1) improved by 16.5 percentage points over prior methods. A cleaner latent space makes both the recommendation model easier to train and the generator easier to control.

GRM and Instruction Model: from interests to instructions

The GRM (Generative Recommendation Model) is the core of the recommendation side. Traditional models predict whether a given video will appeal to a user. GRM instead autoregressively predicts the sequence of D-SIDs corresponding to a user's future interests — based on their static profile and multi-granularity behavior history. Its output is not a video ID but an "intent map" ready to be consumed by the generation system. GRM operates online at ~100 ms latency, fitting within the recommendation system's response window.

The Instruction Model (IM) translates abstract D-SIDs into concrete shot-level production instructions: what each camera should show, when to cut, what voiceover to include, when a CTA should appear. The model is based on Qwen3-8B trained on supervision data generated by Gemini 2.5 Pro. Training proceeds in three phases — starting with a frozen LLM and projectors only, moving to joint fine-tuning, and ending with reward optimization in a loop with SCRL.

VGAs: a multi-agent video production line

The Video Generation Agents consist of three specialized agents: the Visual Planning Agent lays out scenes and timing, the Audio Alignment Agent synchronizes narration and music with the visual rhythm, and the Artistic Effect Enhancement Agent adds captions, transitions, stickers and CTAs. Each agent operates as a sequential decision process — selecting actions (text-to-video, image-to-video, TTS, BGM, effects) and observing the production state before the next step.

The system includes a reflection mechanism: an agent can observe intermediate outputs and revise its plan in up to two iterations to keep end-to-end latency under control (~180 s). Compared with a traditional, template-driven pipeline, VGAs achieve a 41.4 percentage point higher automated win rate and 18.5 pp higher win rate in human user studies.

SCRL: one optimization loop

SCRL (Synergistic Cross-Domain Reward Learning) closes the optimization loop. Rather than summing weighted rewards, the system treats user feedback (clicks, likes, purchases) as the primary objective and assigns interest alignment and video quality as hard constraints with thresholds. When quality or alignment falls below a threshold, the system incurs a penalty. GDPO normalizes rewards across different scales, while PID-controlled Lagrangian multipliers dynamically update constraint weights — eliminating manual hyperparameter tuning.

Each reward category contributes meaningfully: R_visual alone lifts Automated Win Rate by 21.4 pp, while adding the Interest Alignment Reward boosts alignment scores from 0.707 to 0.828 (+17.1%).

The engineering challenge: milliseconds vs. minutes

RaG also solves a fundamental systems-engineering problem: recommendation systems need millisecond responses, but video generation takes minutes. The authors decouple these two worlds: GRM runs online (100 ms), while IM and VGAs run near-line (seconds to minutes), with finished videos stored in a cache. When a user request arrives, the system checks whether the D-SIDs predicted by GRM hit the cache. On a full hit, it returns the video immediately. If Creative SIDs are missing, it returns a content-matched video and generates the creative variant asynchronously. If Content SIDs are absent, it falls back to a nearest-neighbor video and queues generation for the uncovered SIDs.

Why this matters

For a decade, the scale of recommendation systems imposed a tradeoff: the more users, the harder it became to serve truly individual content, because the cost of creating each video was too high. RaG demonstrates that this tradeoff can be broken. On-demand personalized video generation is no longer limited to startups with small user bases — it is running at 400 million users per day and delivering measurable ad revenue.

The architectural precedent is more significant. RaG is not an AIGC layer wrapped around an existing recommender — it is a fundamental redesign: the recommendation model stops selecting from a list and starts forecasting what content should look like. If this pattern takes hold across the industry, the line between "recommendation engine" and "content production platform" will start to dissolve. Advertising video is the first market where such a change is commercially viable — and on-demand video generation could fundamentally reshape the economics of social platforms.

What's next?

The project website (recommendation-as-generation.github.io) publishes example generated ads — model code and weights have not been released
The paper is an arXiv preprint (2606.25496) — peer review is pending, so the reported production numbers should be treated as internal Kuaishou data without external validation
The authors signal potential expansion beyond video advertising — possibly to organic content recommendations in other formats