A model (typically a video transformer or diffusion network) processes frame sequences and is trained to predict masked or future frames. Gradients flow back through time (BPTT), teaching the model temporal coherence and scene physics. After pretraining the model is fine-tuned for downstream tasks (robot control, scene understanding).
Lack of large-scale labelled visual data; need to teach a model scene physics and motion dynamics without human supervision.
A 10s video at 30fps = 300 frames ร patch embeddings โ token sequences are 10-100ร longer than text. Requires aggressive temporal subsampling or frame patchification with large stride.
Inaccurate synchronization between video recordings and action labels (e.g. HID latency) leads to incorrect video-action pairs and degrades world model quality.
Massive attention matrices over frame sequences require high-throughput GPUs with tensor cores.