Architecture

JEPA (Joint Embedding Predictive Architecture)

2022ActiveUpdated: 23 June 2026Published

Key innovation

Prediction in latent (embedding) space rather than raw pixel space — eliminates the cost of modeling irrelevant visual noise while preserving self-supervised learning capability.

How it works

Input data is split into context (visible part) and target (hidden part). The context encoder generates an embedding from visible data. The target encoder (typically an EMA of the context encoder weights) generates a reference embedding from hidden data. The predictor predicts the target embedding from the context embedding — optionally with an additional latent variable z modeling uncertainty. Training minimizes the distance between prediction and reference in feature space.

Problem solved

JEPA solves the problem of inefficient self-supervised learning from image and video data. Generative models waste compute reconstructing irrelevant pixel noise; contrastive models require expensive negative sample mining and augmentations. JEPA enables learning representations of the physical world without either of these costs.

Key mechanisms

Prediction in latent space instead of pixel space

Asymmetric encoder pair: context encoder (trained) + target encoder (EMA of weights)

Predictor as a separate neural head over representations

Block masking of input data for context and target generation

Stop-gradient on the target encoder preventing trivial solutions

Energy-based modeling: low energy for physically consistent configurations

Strengths & limitations

Strengths

✓Compute efficiency — the model does not learn pixel noise

✓Better scalability than generative models for image/video

✓No requirement for negative samples (unlike SimCLR)

✓No requirement for hand-crafted augmentations (unlike BYOL/MoCo)

✓Natural integration with planning (consequence prediction)

✓Coherence with the energy-based models and intuitive-physics perspective

✓Open implementations released by Meta FAIR (I-JEPA, V-JEPA, V-JEPA 2)

Limitations

✗Risk of representation collapse — requires mitigation techniques like EMA + stop-gradient

✗Criticism: repeated application of JEPA prediction in a temporal loop reduces to latent-space autoregression in vector space — no proof yet of robustness to error accumulation over long horizons

✗Weaker performance in static environments with high unstructured background noise (where generative models do better)

✗Most robotic applications remain at research/demo stage — few production deployments

✗AI safety community concerns regarding intrinsic motivation mechanisms in autonomous agents

✗Difficulty in tuning masking ratio and representation dimensionality hyperparameters

Components

Context EncoderProduces the context (visible fragment) representation used by the predictor as input.

Neural network (typically ViT) processing the visible part of input data into an abstract vector representation. Its weights are updated by standard gradient descent.

Official

Target EncoderProvides the reference representation of the hidden fragment, which the prediction should match.

Neural network processing the hidden part of data into a reference (target) representation. Its weights are typically an Exponential Moving Average (EMA) of the context encoder weights. Stop-gradient prevents weight updates from the predictor's error.

PredictorThe key operational element of the architecture — this is the actual world model.

Neural network (typically lighter than the encoders) mapping the context representation to a prediction of the target representation. Can take an additional latent variable z modeling uncertainty and multiple possible futures (stochastic predictor).

Deterministic predictorWithout z variable, produces a single prediction.

Stochastic predictor with latent zModels a distribution of possible futures; useful for uncertain video predictions.

Official

Implementation

Implementation pitfalls

Representation collapseHigh

If both encoders learn to return the same constant value regardless of input, the prediction error drops to zero, but the model becomes useless.

Fix:Using Exponential Moving Average (EMA) of weights for the target encoder, stop-gradient preventing backpropagation through the target encoder, and variance/covariance regularization of representations (as in VICReg). I-JEPA and V-JEPA use EMA + stop-gradient.

Wrong masking strategyMedium

Too small a mask makes prediction trivial (the predictor can interpolate from neighboring pixels). Too large — prediction becomes non-deterministic and the model does not converge.

Fix:Block masking with a 15-30% ratio in I-JEPA. For video additionally tube masking along time. Requires empirical tuning per dataset.

Weaker performance in static environments with background noiseMedium

Architecture designed to filter out irrelevant pixels can lose signal in scenes where subtle, static details (e.g., textures, surface defects) are actually important.

Fix:Hybridization with a generative model (e.g., an additional reconstruction head), or training on downstream tasks sensitive to textures — instead of pure JEPA pretraining.

Evolution

Original paper · 2022 · OpenReview (Position paper, Meta FAIR) · Yann LeCun

A Path Towards Autonomous Machine Intelligence

Yann LeCun

2022

Yann LeCun publishes the position paper defining JEPA

Inflection point

The paper A Path Towards Autonomous Machine Intelligence presents JEPA as a key element of a cognitive architecture for autonomous machine intelligence, embedded in the energy-based models framework.

2023

Meta FAIR releases I-JEPA — the first practical implementation for images

I-JEPA (Image-JEPA, Assran et al., 2023) demonstrates that latent-space prediction for images achieves results comparable to generative (MAE) and contrastive (DINO) models at significantly lower pretraining compute cost.

2024

Meta FAIR releases V-JEPA — extension to video

V-JEPA (Video-JEPA, Bardes et al., 2024) trained on millions of hours of unlabeled video demonstrates learning motion-dynamics representations without frame labeling.

2025

V-JEPA 2 and V-JEPA 2-AC — first real robotics application

Inflection point

V-JEPA 2 with an action-conditioned variant (V-JEPA 2-AC) is used as an internal world model for planning reaching and grasping tasks on unfamiliar objects. The first publicly documented step from architecture to product in robotics.

2026

Reports of Yann LeCun leaving Meta to found a JEPA-focused startup

According to Observer and the industry outlet itwiz, in late 2025/early 2026 LeCun reportedly left Meta to found a new startup dedicated to advancing this architecture. Information at the level of fresh press reports, not officially confirmed.