Robots Atlas>ROBOTS ATLAS
V-JEPA 2

V-JEPA 2

2ย ยทย Family: V-JEPA / JEPA
Self-supervised joint-embedding predictive architecture world model, pre-trained on over 1M hours of video, enabling understanding, prediction and planning in the physical world.
โœ“ Activeโœ“ Public accessโš– Open sourceWorld Model๐Ÿ“ V-JEPA / JEPA
Parameters
1.2B
parameters
Release date
11 June 2025
Access:DownloadDeployment:๐Ÿ’ป Localโ˜ Cloud

Overview

V-JEPA 2 is a world model from FAIR at Meta, built on the Joint-Embedding Predictive Architecture (JEPA) with a Vision Transformer backbone. It is pre-trained in a self-supervised fashion on over one million hours of internet video and images.

The action-conditioned variant V-JEPA 2-AC is post-trained on less than 62 hours of unlabeled robot videos from the Droid dataset and enables zero-shot robotic planning. The authors deploy V-JEPA 2-AC zero-shot on Franka arms in two different labs and perform image-goal pick-and-place without collecting data in the target environments and without task-specific training or reward.

Results

77.3% top-1 on Something-Something v2 (motion understanding) and 39.7 recall-at-5 on Epic-Kitchens-100 (human action anticipation). When aligned with an 8B-parameter LLM, the model reaches 84.0 on PerceptionTest and 76.9 on TempCompass on video question-answering benchmarks.

Classification
World Model
Access & deployment
Download
LocalCloud
Weights: Open source
Key parameters
๐Ÿงฉ Parameters: 1.2B
โœ“ Fine-tuning
๐Ÿ“ฅ Input: video, image
Robotics
Spatial predictionSpatial reasoningEnvironment modelingEmbodied task planningMotion planningRobot manipulationRobot controlScene understanding

Technical specification

Parameters
1.2B
parameters
License
MIT (model weights on Hugging Face)
Hardware requirements
Pretraining of the world model is performed on NVIDIA GPU clusters. Inference of the reference ViT-L variant (~0.3B parameters, 64 frames, 256 px resolution) is feasible on a single consumer-grade GPU. The full 1.2B-parameter variant requires a server-grade GPU (e.g., A100 80GB / H100). Weights and reference code in PyTorch.
Features:โœ“ Fine-tuning
Modalities
โฌ‡ Input
videoimage
โฌ† Output
structured_datamotion_trajectories

Capabilities and applications

Native model capabilities
Video understanding
The model's ability to analyse and interpret video content โ€” recognising actions, motion, events and relationships between objects over time.
Category: video
Planning
The model's ability to determine a sequence of actions leading to a goal โ€” predicting the consequences of actions and selecting an optimal path in a given environment.
Category: planning
Vision encoder
The model's ability to encode images and video frames into dense representations (embeddings), used for downstream tasks or as a backbone for vision-language models.
Category: vision
Robotics
Spatial predictionSpatial reasoningEnvironment modelingEmbodied task planningMotion planningRobot manipulationRobot controlScene understanding

Benchmark results

4 benchmarks
Something-Something v2
top-1 accuracy ยท motion understanding
77.3%
๐Ÿ“„ V-JEPA 2 paper (arXiv:2506.09985)
Epic-Kitchens-100
recall-at-5 ยท human action anticipation
39.7
๐Ÿ“„ V-JEPA 2 paper (arXiv:2506.09985)
PerceptionTest
video QA, V-JEPA 2 aligned with 8B LLM
84.0
๐Ÿ“„ V-JEPA 2 paper (arXiv:2506.09985)
TempCompass
video QA, V-JEPA 2 aligned with 8B LLM
76.9
๐Ÿ“„ V-JEPA 2 paper (arXiv:2506.09985)

Technical architecture

Core Architecture
Training Techniques