The model takes historical frames (video context) and an encoded action vector. Via cross-attention or concatenation into the token stream, the action vector modulates generation of subsequent frames. The model is trained on (observation, action, next observation) tuples from teleoperation or play data.
Cost and risk of real-world data collection; need for a photorealistic simulator available without specialised physics software.
Generative video models often lose object consistency between frames, especially with rapid camera movements or long sequences โ a critical issue for world models in robotics.
A model trained on action-video pairs from one robot may not generalize to other morphologies or environments due to strong overfitting to the specific action signal.
High-resolution video generation in a loop with observations requires GPUs with high throughput.