Robots Atlas>ROBOTS ATLAS
V-JEPA

V-JEPA

1ย ยทย Family: V-JEPA / JEPA
Self-supervised video vision model based on the Joint-Embedding Predictive Architecture, learning representations purely through feature prediction โ€” without data augmentation, text, or pixel reconstruction.
๐Ÿ“ฆ Archived๐Ÿ”ฌ Research onlyโš– Open weightsVision๐Ÿ“ V-JEPA / JEPA
Parameters
ViT-L (~300M) โ€“ ViT-H/16 (~630M)
parameters
Release date
15 February 2024
Access:DownloadDeployment:๐Ÿ’ป Local

Overview

V-JEPA (Video Joint-Embedding Predictive Architecture) is a self-supervised method for learning visual representations from video, developed at Meta FAIR. The paper "Revisiting Feature Prediction for Learning Visual Representations from Video" (arXiv:2404.08471) was released on 15 February 2024. Authors: Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, Nicolas Ballas.

V-JEPA is trained solely with a feature prediction objective: without pretrained image encoders, text, negative examples or pixel reconstruction. It extends the I-JEPA (2023) approach to the video domain and is a direct predecessor of V-JEPA 2.

Architecture and data

Vision Transformer backbone (ViT-L and ViT-H/16). Pre-trained on a collection of 2 million videos (VideoMix2M) gathered from public video datasets. Representations are evaluated on downstream image and video tasks without adapting the model's parameters (frozen backbone).

Results

The largest model, a ViT-H/16 trained only on videos, reaches: 81.9% top-1 on Kinetics-400, 72.2% on Something-Something v2 and 77.9% on ImageNet-1K โ€” with a frozen backbone. This shows that learning by predicting video features yields versatile visual representations that perform well on both motion- and appearance-based tasks.

Position in the JEPA family

V-JEPA is the first JEPA-family model trained on video, the successor of I-JEPA (2023) and the direct predecessor of V-JEPA 2 (2025), which scales the approach to more than one million hours of video and adds an action-conditioned variant for robotic planning.

Classification
Vision
Access & deployment
Download
Local
Weights: Open weights
Key parameters
๐Ÿงฉ Parameters: ViT-L (~300M) โ€“ ViT-H/16 (~630M)
โœ“ Fine-tuning
๐Ÿ“ฅ Input: video, image

Technical specification

Parameters
ViT-L (~300M) โ€“ ViT-H/16 (~630M)
parameters
Hardware requirements
Pre-training of the ViT-H/16 backbone is performed on NVIDIA A100 GPU clusters. Inference is feasible on a single consumer- or server-grade GPU.
Features:โœ“ Fine-tuning
Modalities
โฌ‡ Input
videoimage
โฌ† Output
structured_data

Capabilities and applications

Native model capabilities
Vision encoder
The model's ability to encode images and video frames into dense representations (embeddings), used for downstream tasks or as a backbone for vision-language models.
Category: vision
Video understanding
The model's ability to analyse and interpret video content โ€” recognising actions, motion, events and relationships between objects over time.
Category: video

Benchmark results

3 benchmarks
Kinetics-400
top-1 accuracy ยท frozen backbone, ViT-H/16
81.9%
๐Ÿ“„ V-JEPA paper (arXiv:2404.08471)
Something-Something v2
top-1 accuracy ยท frozen backbone, ViT-H/16
72.2%
๐Ÿ“„ V-JEPA paper (arXiv:2404.08471)
ImageNet-1K
top-1 accuracy ยท frozen backbone, ViT-H/16
77.9%
๐Ÿ“„ V-JEPA paper (arXiv:2404.08471)

Technical architecture

Core Architecture
Training Techniques