V-JEPA 2

Self-supervised joint-embedding predictive architecture world model, pre-trained on over 1M hours of video, enabling understanding, prediction and planning in the physical world.

✓ Active✓ Public access⚖ Open sourceWorld Model📁 V-JEPA / JEPA

Parameters

1.2B

parameters

Release date

11 June 2025

🏢Meta AIProducer

Access:DownloadDeployment:💻 Local☁ Cloud

Overview

V-JEPA 2 is a world model from FAIR at Meta, built on the Joint-Embedding Predictive Architecture (JEPA) with a Vision Transformer backbone. It is pre-trained in a self-supervised fashion on over one million hours of internet video and images.

The action-conditioned variant V-JEPA 2-AC is post-trained on less than 62 hours of unlabeled robot videos from the Droid dataset and enables zero-shot robotic planning. The authors deploy V-JEPA 2-AC zero-shot on Franka arms in two different labs and perform image-goal pick-and-place without collecting data in the target environments and without task-specific training or reward.

Results

77.3% top-1 on Something-Something v2 (motion understanding) and 39.7 recall-at-5 on Epic-Kitchens-100 (human action anticipation). When aligned with an 8B-parameter LLM, the model reaches 84.0 on PerceptionTest and 76.9 on TempCompass on video question-answering benchmarks.

Classification

World Model

Family: V-JEPA / JEPA

Access & deployment

Download

LocalCloud

Weights: Open source

Key parameters

🧩 Parameters: 1.2B

✓ Fine-tuning

📥 Input: video, image

Robotics

Spatial predictionSpatial reasoningEnvironment modelingEmbodied task planningMotion planningRobot manipulationRobot controlScene understanding

Technical specification

Parameters

1.2B

parameters

License

MIT (model weights on Hugging Face)

Hardware requirements

Pretraining of the world model is performed on NVIDIA GPU clusters. Inference of the reference ViT-L variant (~0.3B parameters, 64 frames, 256 px resolution) is feasible on a single consumer-grade GPU. The full 1.2B-parameter variant requires a server-grade GPU (e.g., A100 80GB / H100). Weights and reference code in PyTorch.

Features:✓ Fine-tuning

Modalities

⬇ Input

videoimage

⬆ Output

structured_datamotion_trajectories

Capabilities and applications

Native model capabilities

Video understanding

The model's ability to analyse and interpret video content — recognising actions, motion, events and relationships between objects over time.

Category: video

Planning

Forming and executing action plans for complex tasks.

Category: planning

Vision encoder

The model's ability to encode images and video frames into dense representations (embeddings), used for downstream tasks or as a backbone for vision-language models.

Category: vision

Robotics

Spatial predictionSpatial reasoningEnvironment modelingEmbodied task planningMotion planningRobot manipulationRobot controlScene understanding

Benchmark results

4 benchmarks

Something-Something v2

top-1 accuracy · motion understanding

77.3%

📄 V-JEPA 2 paper (arXiv:2506.09985)

Epic-Kitchens-100

recall-at-5 · human action anticipation

39.7

📄 V-JEPA 2 paper (arXiv:2506.09985)

PerceptionTest

video QA, V-JEPA 2 aligned with 8B LLM

84.0

📄 V-JEPA 2 paper (arXiv:2506.09985)

TempCompass

video QA, V-JEPA 2 aligned with 8B LLM

76.9

📄 V-JEPA 2 paper (arXiv:2506.09985)

Technical architecture

Core Architecture

VIViT

Model Form

WMWorld Models WAWAM

Training Techniques

PRPretraining

Sources and related pages

4 sources

PaperV-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planningarxiv.org BlogOur New Model Helps AI Think Before it Acts (Meta Newsroom)about.fb.com Repofacebookresearch/vjepa2 (GitHub)github.com DocsV-JEPA 2 model card (Hugging Face, facebook/vjepa2-vitl-fpc64-256)huggingface.co

Browse related topics

📁 V-JEPA / JEPA 🧠 ViT 🧠 World Models 🧠 WAM All world model models