Video Pretraining
Learning visual representations by predicting video frame sequences in a self-supervised manner, instead of image classification.
A model (typically a video transformer or diffusion network) processes frame sequences and is trained to predict masked or future frames. Gradients flow back through time (BPTT), teaching the model temporal coherence and scene physics. After pretraining the model is fine-tuned for downstream tasks (robot control, scene understanding).
Lack of large-scale labelled visual data; need to teach a model scene physics and motion dynamics without human supervision.
Partially parallel
Dense
All paths active
VideoCLIP and VideoMAE β first scalable video pretraining with masked modelling
breakthroughSora (OpenAI) and Genie (DeepMind) demonstrate generative video pretraining at scale
breakthroughUnifoLM-WMA-0 (Unitree) applies video pretraining as the foundation of a world-model-action framework for robotics
Massive attention matrices over frame sequences require high-throughput GPUs with tensor cores.
BUILT ON
Pretraining
Pretraining (self-supervised pretraining) is the first and most expensive stage in building modern foundation models. The model learns to predict missing or next portions of data β next tokens in text, masked words, future video frames, future robot states β without human labels. This unlocks virtually unlimited raw data (web crawls, code, books, YouTube video, robot telemetry). The result is a set of weights encoding "world knowledge" β dense statistical representations that can later be fine-tuned, instruction-tuned, or RLHF-aligned for any downstream task. Pretraining underpins GPT, BERT, CLIP, Llama, Gemini, and robotics foundation models (Pi-Zero, Gemini Robotics, Ti0).
GO TO CONCEPT