Input data is split into context (visible part) and target (hidden part). The context encoder generates an embedding from visible data. The target encoder (typically an EMA of the context encoder weights) generates a reference embedding from hidden data. The predictor predicts the target embedding from the context embedding — optionally with an additional latent variable z modeling uncertainty. Training minimizes the distance between prediction and reference in feature space.
JEPA solves the problem of inefficient self-supervised learning from image and video data. Generative models waste compute reconstructing irrelevant pixel noise; contrastive models require expensive negative sample mining and augmentations. JEPA enables learning representations of the physical world without either of these costs.
Neural network (typically ViT) processing the visible part of input data into an abstract vector representation. Its weights are updated by standard gradient descent.
Official
Neural network processing the hidden part of data into a reference (target) representation. Its weights are typically an Exponential Moving Average (EMA) of the context encoder weights. Stop-gradient prevents weight updates from the predictor's error.
Neural network (typically lighter than the encoders) mapping the context representation to a prediction of the target representation. Can take an additional latent variable z modeling uncertainty and multiple possible futures (stochastic predictor).
Official
If both encoders learn to return the same constant value regardless of input, the prediction error drops to zero, but the model becomes useless.
Too small a mask makes prediction trivial (the predictor can interpolate from neighboring pixels). Too large — prediction becomes non-deterministic and the model does not converge.
Architecture designed to filter out irrelevant pixels can lose signal in scenes where subtle, static details (e.g., textures, surface defects) are actually important.
The paper A Path Towards Autonomous Machine Intelligence presents JEPA as a key element of a cognitive architecture for autonomous machine intelligence, embedded in the energy-based models framework.
I-JEPA (Image-JEPA, Assran et al., 2023) demonstrates that latent-space prediction for images achieves results comparable to generative (MAE) and contrastive (DINO) models at significantly lower pretraining compute cost.
V-JEPA (Video-JEPA, Bardes et al., 2024) trained on millions of hours of unlabeled video demonstrates learning motion-dynamics representations without frame labeling.
V-JEPA 2 with an action-conditioned variant (V-JEPA 2-AC) is used as an internal world model for planning reaching and grasping tasks on unfamiliar objects. The first publicly documented step from architecture to product in robotics.
According to Observer and the industry outlet itwiz, in late 2025/early 2026 LeCun reportedly left Meta to found a new startup dedicated to advancing this architecture. Information at the level of fresh press reports, not officially confirmed.
JEPA main compute cost is forward+backward through the context encoder plus forward through the target encoder (no backprop thanks to stop-gradient). For Vision Transformer, the dominant cost is self-attention O(n²·d). The video dataset scale (V-JEPA 2 trained on millions of hours) requires multi-GPU A100/H100 clusters.
ViT-based encoders are fully parallelized on GPU (matmul-heavy, attention). V-JEPA 2 trained on multi-GPU A100/H100 clusters.
ViT architecture maps well to TPU; no public JEPA implementations on TPU from Meta, but theoretically compatible.
JEPA inference as a robot world model is possible on edge (Jetson Orin/Thor), but a full ViT-Huge base may require quantization or distillation to a smaller model.