
Self-supervised joint-embedding predictive architecture world model, pre-trained on over 1M hours of video, enabling understanding, prediction and planning in the physical world.
Parameters
1.2B
parameters
Release date
11 June 2025
Access:DownloadDeployment:๐ป Localโ Cloud
Overview
Access & deployment
Download
LocalCloud
Weights: Open source
Key parameters
๐งฉ Parameters: 1.2B
โ Fine-tuning
๐ฅ Input: video, image
Robotics
Spatial predictionSpatial reasoningEnvironment modelingEmbodied task planningMotion planningRobot manipulationRobot controlScene understanding
Technical specification
Parameters
1.2B
parameters
License
MIT (model weights on Hugging Face)
Hardware requirements
Pretraining of the world model is performed on NVIDIA GPU clusters. Inference of the reference ViT-L variant (~0.3B parameters, 64 frames, 256 px resolution) is feasible on a single consumer-grade GPU. The full 1.2B-parameter variant requires a server-grade GPU (e.g., A100 80GB / H100). Weights and reference code in PyTorch.
Features:โ Fine-tuning
Modalities
โฌ Input
videoimage
โฌ Output
structured_datamotion_trajectories
Capabilities and applications
Native model capabilities
Video understanding
The model's ability to analyse and interpret video content โ recognising actions, motion, events and relationships between objects over time.
Category: video
Planning
The model's ability to determine a sequence of actions leading to a goal โ predicting the consequences of actions and selecting an optimal path in a given environment.
Category: planning
Vision encoder
The model's ability to encode images and video frames into dense representations (embeddings), used for downstream tasks or as a backbone for vision-language models.
Category: vision
Robotics
Spatial predictionSpatial reasoningEnvironment modelingEmbodied task planningMotion planningRobot manipulationRobot controlScene understanding
Benchmark results
4 benchmarks
Something-Something v2
top-1 accuracy ยท motion understanding
77.3%
๐ V-JEPA 2 paper (arXiv:2506.09985)
Epic-Kitchens-100
recall-at-5 ยท human action anticipation
39.7
๐ V-JEPA 2 paper (arXiv:2506.09985)
PerceptionTest
video QA, V-JEPA 2 aligned with 8B LLM
84.0
๐ V-JEPA 2 paper (arXiv:2506.09985)
TempCompass
video QA, V-JEPA 2 aligned with 8B LLM
76.9
๐ V-JEPA 2 paper (arXiv:2506.09985)