I-JEPA

Self-supervised vision model based on the Joint-Embedding Predictive Architecture, learning semantic image representations by predicting embeddings of masked image regions.

📦 Archived🔬 Research only⚖ Open weightsVision📁 V-JEPA / JEPA

Parameters

~632M (ViT-H/14) – ~1B (ViT-g/16)

parameters

Release date

19 January 2023

🏢Meta AIProducer

Access:DownloadDeployment:💻 Local

Overview

I-JEPA (Image-based Joint-Embedding Predictive Architecture) is a self-supervised method for learning image representations developed at Meta FAIR. First released as arXiv:2301.08243 on 19 January 2023 and presented at CVPR 2023 as a Highlight. Authors: Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, Nicolas Ballas.

The core idea of I-JEPA: from a single context block, the model predicts the representations (embeddings) of various target blocks in the same image — without generating pixels and without hand-crafted data augmentations. Two design choices are crucial: target blocks must be sufficiently large (semantic) and the context block must be sufficiently informative (spatially distributed).

Architecture and scale

The model uses a Vision Transformer backbone (ViT-H/14, ViT-H/16 448px, ViT-g/16). I-JEPA is computationally efficient: training ViT-H/14 on ImageNet-1K takes under 72 hours on 16 A100 GPUs. Reference weights for the ViT-H/14, ViT-H/16 (448px) and ViT-g/16 variants are publicly available (pretrained on ImageNet-1K and ImageNet-22K).

Position in the JEPA family

I-JEPA is the first full model from the JEPA family for images. It is a starting point for the later video models V-JEPA (2024) and V-JEPA 2 (2025). The code repository was archived on 1 August 2024 — further work is carried out in the V-JEPA / V-JEPA 2 projects.

Classification

Vision

Family: V-JEPA / JEPA

Access & deployment

Download

Local

Weights: Open weights

Key parameters

🧩 Parameters: ~632M (ViT-H/14) – ~1B (ViT-g/16)

✓ Fine-tuning

📥 Input: image

Technical specification

Parameters

~632M (ViT-H/14) – ~1B (ViT-g/16)

parameters

Hardware requirements

The reference ViT-H/14 training on ImageNet-1K was performed on 16 NVIDIA A100 80GB GPUs (effective batch size 2048) in under 72 hours. Inference is feasible on a single consumer-grade GPU.

Features:✓ Fine-tuning

Modalities

⬇ Input

image

⬆ Output

structured_data

Capabilities and applications

Native model capabilities

Vision encoder

The model's ability to encode images and video frames into dense representations (embeddings), used for downstream tasks or as a backbone for vision-language models.

Category: vision

Technical architecture

Core Architecture

VIViT

Training Techniques

PRPretraining

Sources and related pages

3 sources

PaperSelf-Supervised Learning from Images with a Joint-Embedding Predictive Architecture (arXiv:2301.08243)arxiv.org BlogI-JEPA: A first AI model based on Yann LeCun's vision for more human-like AI (Meta AI)ai.meta.com Repofacebookresearch/ijepa (GitHub, archived)github.com

Browse related topics

📁 V-JEPA / JEPA 🧠 ViT All vision model models