V-JEPA

Self-supervised video vision model based on the Joint-Embedding Predictive Architecture, learning representations purely through feature prediction — without data augmentation, text, or pixel reconstruction.

📦 Archived🔬 Research only⚖ Open weightsVision📁 V-JEPA / JEPA

Parameters

ViT-L (~300M) – ViT-H/16 (~630M)

parameters

Release date

15 February 2024

🏢Meta AIProducer

Access:DownloadDeployment:💻 Local

Overview

V-JEPA (Video Joint-Embedding Predictive Architecture) is a self-supervised method for learning visual representations from video, developed at Meta FAIR. The paper "Revisiting Feature Prediction for Learning Visual Representations from Video" (arXiv:2404.08471) was released on 15 February 2024. Authors: Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, Nicolas Ballas.

V-JEPA is trained solely with a feature prediction objective: without pretrained image encoders, text, negative examples or pixel reconstruction. It extends the I-JEPA (2023) approach to the video domain and is a direct predecessor of V-JEPA 2.

Architecture and data

Vision Transformer backbone (ViT-L and ViT-H/16). Pre-trained on a collection of 2 million videos (VideoMix2M) gathered from public video datasets. Representations are evaluated on downstream image and video tasks without adapting the model's parameters (frozen backbone).

Results

The largest model, a ViT-H/16 trained only on videos, reaches: 81.9% top-1 on Kinetics-400, 72.2% on Something-Something v2 and 77.9% on ImageNet-1K — with a frozen backbone. This shows that learning by predicting video features yields versatile visual representations that perform well on both motion- and appearance-based tasks.

Position in the JEPA family

V-JEPA is the first JEPA-family model trained on video, the successor of I-JEPA (2023) and the direct predecessor of V-JEPA 2 (2025), which scales the approach to more than one million hours of video and adds an action-conditioned variant for robotic planning.

Classification

Vision

Family: V-JEPA / JEPA

Access & deployment

Download

Local

Weights: Open weights

Key parameters

🧩 Parameters: ViT-L (~300M) – ViT-H/16 (~630M)

✓ Fine-tuning

📥 Input: video, image

Technical specification

Parameters

ViT-L (~300M) – ViT-H/16 (~630M)

parameters

Hardware requirements

Pre-training of the ViT-H/16 backbone is performed on NVIDIA A100 GPU clusters. Inference is feasible on a single consumer- or server-grade GPU.

Features:✓ Fine-tuning

Modalities

⬇ Input

videoimage

⬆ Output

structured_data

Capabilities and applications

Native model capabilities

Vision encoder

The model's ability to encode images and video frames into dense representations (embeddings), used for downstream tasks or as a backbone for vision-language models.

Category: vision

Video understanding

The model's ability to analyse and interpret video content — recognising actions, motion, events and relationships between objects over time.

Category: video

Benchmark results

3 benchmarks

Kinetics-400

top-1 accuracy · frozen backbone, ViT-H/16

81.9%

📄 V-JEPA paper (arXiv:2404.08471)

Something-Something v2

top-1 accuracy · frozen backbone, ViT-H/16

72.2%

📄 V-JEPA paper (arXiv:2404.08471)

ImageNet-1K

top-1 accuracy · frozen backbone, ViT-H/16

77.9%

📄 V-JEPA paper (arXiv:2404.08471)

Technical architecture

Core Architecture

VIViT

Training Techniques

PRPretraining

Sources and related pages

3 sources

PaperRevisiting Feature Prediction for Learning Visual Representations from Video (V-JEPA, arXiv:2404.08471)arxiv.org BlogV-JEPA: The next step toward advanced machine intelligence (Meta AI)ai.meta.com Repofacebookresearch/jepa (GitHub)github.com

Browse related topics

📁 V-JEPA / JEPA 🧠 ViT All vision model models