Robots Atlas>ROBOTS ATLAS
I-JEPA
Self-supervised vision model based on the Joint-Embedding Predictive Architecture, learning semantic image representations by predicting embeddings of masked image regions.
๐Ÿ“ฆ Archived๐Ÿ”ฌ Research onlyโš– Open weightsVision๐Ÿ“ V-JEPA / JEPA
Parameters
~632M (ViT-H/14) โ€“ ~1B (ViT-g/16)
parameters
Release date
19 January 2023
Access:DownloadDeployment:๐Ÿ’ป Local

Overview

I-JEPA (Image-based Joint-Embedding Predictive Architecture) is a self-supervised method for learning image representations developed at Meta FAIR. First released as arXiv:2301.08243 on 19 January 2023 and presented at CVPR 2023 as a Highlight. Authors: Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, Nicolas Ballas.

The core idea of I-JEPA: from a single context block, the model predicts the representations (embeddings) of various target blocks in the same image โ€” without generating pixels and without hand-crafted data augmentations. Two design choices are crucial: target blocks must be sufficiently large (semantic) and the context block must be sufficiently informative (spatially distributed).

Architecture and scale

The model uses a Vision Transformer backbone (ViT-H/14, ViT-H/16 448px, ViT-g/16). I-JEPA is computationally efficient: training ViT-H/14 on ImageNet-1K takes under 72 hours on 16 A100 GPUs. Reference weights for the ViT-H/14, ViT-H/16 (448px) and ViT-g/16 variants are publicly available (pretrained on ImageNet-1K and ImageNet-22K).

Position in the JEPA family

I-JEPA is the first full model from the JEPA family for images. It is a starting point for the later video models V-JEPA (2024) and V-JEPA 2 (2025). The code repository was archived on 1 August 2024 โ€” further work is carried out in the V-JEPA / V-JEPA 2 projects.

Classification
Vision
Access & deployment
Download
Local
Weights: Open weights
Key parameters
๐Ÿงฉ Parameters: ~632M (ViT-H/14) โ€“ ~1B (ViT-g/16)
โœ“ Fine-tuning
๐Ÿ“ฅ Input: image

Technical specification

Parameters
~632M (ViT-H/14) โ€“ ~1B (ViT-g/16)
parameters
Hardware requirements
The reference ViT-H/14 training on ImageNet-1K was performed on 16 NVIDIA A100 80GB GPUs (effective batch size 2048) in under 72 hours. Inference is feasible on a single consumer-grade GPU.
Features:โœ“ Fine-tuning
Modalities
โฌ‡ Input
image
โฌ† Output
structured_data

Capabilities and applications

Native model capabilities
Vision encoder
The model's ability to encode images and video frames into dense representations (embeddings), used for downstream tasks or as a backbone for vision-language models.
Category: vision

Technical architecture

Core Architecture
Training Techniques