Action-Conditioned Video Generation

Generating future video frames conditioned on a specific robot action, enabling simulation of action consequences before execution.

BUILT ON

World Models

World Models is an architectural paradigm in model-based reinforcement learning (MBRL) in which an agent learns a compact, generative internal model of its environment's dynamics — enabling the agent to imagine or 'dream' future states and train its policy controller inside these internally simulated trajectories, rather than relying exclusively on costly real-environment interactions. The concept was formally demonstrated and synthesized by Ha and Schmidhuber (2018), who traced its conceptual roots to Schmidhuber's 1990 series of papers on RNN-based world models and controllers. The 2018 formalization introduced a three-component architecture: (1) V — a Vision model (Variational Autoencoder) that compresses high-dimensional observations (pixel images) into low-dimensional latent vectors z; (2) M — a Memory model (Mixture Density Network RNN, MDN-RNN) that models the temporal dynamics of the environment by predicting future latent states given current latent state and agent action; and (3) C — a Controller (compact linear model) that maps the concatenated latent state and RNN hidden state to actions, trained with evolutionary strategies to maximize reward. The critical result of Ha & Schmidhuber (2018) was demonstrating that the controller can be trained entirely within hallucinated dream sequences generated by the world model, and the resulting policy can be transferred to the real environment. This decoupling of perception, prediction, and control enables training on synthetic data and greatly improves sample efficiency relative to model-free RL. Subsequent work extended the paradigm: PlaNet (Hafner et al., 2019) introduced planning in latent space with RSSM; the Dreamer series (Hafner et al., 2019–2023) combined world model learning with actor-critic training entirely in imagination, achieving state-of-the-art results across many environments with DreamerV3; MuZero (Schrittwieser et al., 2020) showed that a world model capturing only decision-relevant dynamics (reward, value, policy) is sufficient for planning. More recently, the paradigm has been extended to generative video models (Genie, Sora) used as interactive environment simulators.

GO TO CONCEPT

Diffusion Model

Diffusion models are probabilistic generative models built on two Markov chain processes: a fixed forward process that gradually adds Gaussian noise to data over T timesteps until the distribution converges to an isotropic Gaussian, and a learned reverse process that denoises step-by-step to reconstruct samples from the data distribution. The forward process is analytically tractable and requires no training; the reverse process is parameterized by a neural network (commonly a U-Net) conditioned on the current timestep, trained to predict the noise added at each step via a simplified variational objective (noise prediction loss). At inference, new samples are generated by starting from pure Gaussian noise and applying T learned denoising steps sequentially. The concept was introduced by Sohl-Dickstein et al. (2015) and made practically competitive by Ho et al. (2020) through the DDPM framework. Later work extended the paradigm to latent spaces (Latent Diffusion Models), continuous-time SDE formulations (Song et al., 2021), accelerated sampling (DDIM), and conditional generation via classifier-free guidance.

GO TO CONCEPT

Commonly used with

VLA

Vision-Language-Action (VLA) is an architectural paradigm for robotic control introduced formally by Google DeepMind's RT-2 (Zitkovich et al., 2023). A VLA model is constructed by adapting a pretrained vision-language model (VLM) to additionally output robot action tokens, enabling a single end-to-end model to perceive the scene, understand language instructions, and generate executable robot actions. The core insight is that robot actions can be represented as discrete tokens within the existing vocabulary of a language model. RT-2 discretized the 7-dimensional end-effector action space (XYZ position, XYZ rotation, gripper extension) into 256 bins each, encoded as text tokens, and co-fine-tuned a large VLM (PaLI-X 5B/55B, PaLM-E 12B) on both internet-scale vision-language tasks and robot trajectory data. This joint training transfers semantic and reasoning capabilities from web-scale pretraining to physical robot control. VLA architectures consist of three conceptual components: (1) a vision encoder (e.g., ViT, CLIP, DINOv2, SigLIP) that produces visual token embeddings from RGB camera observations; (2) a language backbone (e.g., PaLM, LLaMA, Gemma) that processes both visual and text tokens; and (3) an action decoder that generates robot action tokens or continuous action vectors. The action output can be discrete (tokenized, as in RT-2 and OpenVLA) or continuous (diffusion/flow-based, as in π0). Subsequent work distinguished single-model VLAs (RT-2, OpenVLA, π0) from dual-system designs (Helix, Groot N1) where a slower VLM planner is coupled with a faster action execution module. OpenVLA (Kim et al., Stanford, 2024) open-sourced a 7B-parameter VLA trained on 970k trajectories from the Open X-Embodiment dataset.

GO TO CONCEPT

Related AI models

Other

UnifoLM-WMA-0

Back to technology catalog

Action-Conditioned Video Generation

Use cases

How it works

Problem solved

History and evolution

Preferred hardware

Semantic relations

BUILT ON

Commonly used with

Related models and families

Related AI models

Other