
Self-supervised video vision model based on the Joint-Embedding Predictive Architecture, learning representations purely through feature prediction โ without data augmentation, text, or pixel reconstruction.
Parameters
ViT-L (~300M) โ ViT-H/16 (~630M)
parameters
Release date
15 February 2024
Access:DownloadDeployment:๐ป Local
Overview
Access & deployment
Download
Local
Weights: Open weights
Key parameters
๐งฉ Parameters: ViT-L (~300M) โ ViT-H/16 (~630M)
โ Fine-tuning
๐ฅ Input: video, image
Technical specification
Parameters
ViT-L (~300M) โ ViT-H/16 (~630M)
parameters
Hardware requirements
Pre-training of the ViT-H/16 backbone is performed on NVIDIA A100 GPU clusters. Inference is feasible on a single consumer- or server-grade GPU.
Features:โ Fine-tuning
Modalities
โฌ Input
videoimage
โฌ Output
structured_data
Capabilities and applications
Native model capabilities
Vision encoder
The model's ability to encode images and video frames into dense representations (embeddings), used for downstream tasks or as a backbone for vision-language models.
Category: vision
Video understanding
The model's ability to analyse and interpret video content โ recognising actions, motion, events and relationships between objects over time.
Category: video
Benchmark results
3 benchmarks
Kinetics-400
top-1 accuracy ยท frozen backbone, ViT-H/16
81.9%
๐ V-JEPA paper (arXiv:2404.08471)
Something-Something v2
top-1 accuracy ยท frozen backbone, ViT-H/16
72.2%
๐ V-JEPA paper (arXiv:2404.08471)
ImageNet-1K
top-1 accuracy ยท frozen backbone, ViT-H/16
77.9%
๐ V-JEPA paper (arXiv:2404.08471)
Technical architecture
Core Architecture
Training Techniques