Robots Atlas>ROBOTS ATLAS
AI Architecture

What is JEPA? Yann LeCun's architecture for world models

czym-jest-jepa-architektura-modeli-swiata-wedlug-yanna-lecuna-cover

JEPA (Joint Embedding Predictive Architecture) is a machine learning architecture that learns to predict abstract representations of the world instead of reconstructing raw pixels. Created by Yann LeCun, it aims to give machines something large language models lack — an intuitive grasp of physics and causality.

What is JEPA?

JEPA, short for Joint Embedding Predictive Architecture, is a self-supervised learning architecture, not a single AI model. It is best understood as a design blueprint — a way of building neural networks that learn to understand images or video not by reconstructing every detail, but by predicting the meaning of what they cannot see.

The key difference from today's dominant approach is this. Large language models learn by guessing the next word in a text. Generative image models learn by reconstructing missing pixels. JEPA does something else: it hides part of the input and then tries to predict the abstract representation of that hidden part, rather than its exact appearance. In other words, the model learns to predict what the missing region means, not how it looks pixel by pixel.

This seemingly small change has large consequences. The world at the pixel level is chaotic — leaves flutter in the wind, water ripples, textures are unpredictable. A model that tries to predict every such detail wastes compute modeling noise. JEPA deliberately ignores that noise and focuses on structure and semantics. LeCun introduced the concept in his 2022 paper A Path Towards Autonomous Machine Intelligence, which is best read as a manifesto for the whole approach.

Who is behind it?

The concept comes from Yann LeCun, a 2018 Turing Award laureate and one of the pioneers of deep learning, for years the chief AI scientist at Meta and head of its FAIR (Fundamental AI Research) lab. Meta produced the first public implementations of JEPA, released as open research models.

LeCun consistently argues that large language models are a dead end on the road to human-level intelligence, because they lack an understanding of the physical world, the ability to plan, and common sense. From his perspective, real intelligence must rest on world models — internal simulations of reality, similar to those an infant builds in its head while observing its surroundings.

In November 2025 LeCun announced he was leaving Meta after ten years, and in December he co-founded the startup Advanced Machine Intelligence Labs (AMI Labs), focused precisely on world models. The JEPA architecture and its existing variants, meanwhile, remain well documented in Meta’s publications and on arXiv, regardless of its creator’s career moves.

How does it work?

Before diving into the details, it helps to grasp the core intuition. If we cover part of a photo and ask “what is there?”, a person does not reconstruct every pixel in their head — they immediately think in concepts, for example “a hand holding a mug”. JEPA imitates exactly this: instead of guessing the exact appearance of the hidden part, it tries to predict its meaning.

The key question, then, is how to teach a network such “concepts” when no one tells it explicitly what they are. JEPA’s answer is to compare predictions not at the level of pixels but at the level of abstract representations — and it is precisely this idea that we break down below.

JEPA’s mechanism rests on prediction in latent space, not in the space of raw data. The whole training process can be broken down into five repeated steps.

Step 1 — Masking. From a single image or video sequence we separate a visible part (the context, ) and a hidden part the model must guess (the target, ).

Step 2 — Encoding. The context and the target pass through two separate encoders that turn pixels into abstract feature vectors: a list of numbers representing the abstract, high-level properties of the input data rather than its raw appearance:

Symbol meaning
visible context (e.g. a patch of the image)
hidden part — the prediction target
context encoder (weights trained by backprop)
target encoder (EMA weights, no backprop)
representations (feature vectors) of context and target

Step 3 — Prediction. The predictor tries to reconstruct the target representation using only the context representation and an optional latent variable that encodes uncertainty and multiple possible futures:

Symbol meaning
predictor — network mapping context to target
latent variable encoding uncertainty and alternative futures
predicted target representation

Step 4 — Loss. Training minimizes the distance between the predicted and the actual target representation — measured in feature space, not in pixels:

Symbol meaning
loss minimized during training
stop-gradient — blocks learning of the target encoder
squared Euclidean distance between vectors

The (stop-gradient) operator blocks gradient flow through the target encoder — without it the network could cheat by collapsing both representations to the same constant (representation collapse).

Step 5 — Updating the target encoder. The target encoder’s weights are not trained by backpropagation. Instead they slowly track the context encoder as an exponential moving average (EMA):

Symbol meaning
momentum coefficient, close to 1
weights of the context and target encoders

The coefficient is close to one (e.g. around 0.99), so the target encoder changes smoothly and stabilizes the whole training.

The whole mechanism can also be framed in the language of energy-based models: the (context, target) pair is assigned an “energy” equal to the prediction error,

Symbol meaning
energy assigned to the context–target pair (low = pair consistent with reality)
target representation predicted from the context
actual target representation
squared distance between them — i.e. the prediction error

and the model learns to lower it for pairs consistent with the physics of the world and raise it for inconsistent ones. Low energy means the context and target genuinely match.

What are its key components?

A standard JEPA architecture rests on three modules, usually built on Vision Transformer (ViT) backbones:

  • Context encoder — processes the visible part of the data and produces its abstract representation, filtering out irrelevant background.
  • Target encoder — processes the hidden part of the data and provides a reference representation used as the ground truth during training.
  • Predictor — the main operational component, which predicts the target representation from the context representation. It can use an additional latent variable z to model uncertainty and multiple possible futures.

The biggest challenge of this design is representation collapse. If both encoders learn to return the same constant value regardless of the input, the prediction error drops to zero but the model becomes useless. To prevent this, regularization techniques are used — in the early variants this meant encoder asymmetry and updating the target encoder weights via an exponential moving average (EMA: exponential moving average — a slow averaging of weights over time that gives more weight to recent values, so the target encoder changes gradually and stably), while newer approaches rely on related methods such as the variance and covariance regularization known from VICReg: Variance-Invariance-Covariance Regularization — a method that prevents representation collapse by keeping feature variance high and decorrelating the features.

What can it be used for?

The most promising application area for JEPA is robotics and control. Because the architecture learns to predict the consequences of events in representation space, it can serve as an internal "simulator" that lets a robot plan actions before executing them.

Meta demonstrated this direction with the V-JEPA 2 model and its action-conditioned variant (V-JEPA 2-AC). According to Meta's publication and the accompanying arXiv paper, a model fine-tuned on recordings of robotic arm motion could plan reaching and grasping tasks with unfamiliar objects in new surroundings, using only visual data and without traditional simulation training. This capability is described as zero-shot: performing a task without prior training on that specific task planning.

Beyond robotics, natural targets include autonomous vehicles, industrial systems, and any application that requires understanding the physical dynamics of an environment. It should be stressed, however, that most of these use cases are at the research and demonstration stage, not large-scale production deployment.

How does it differ from other approaches?

JEPA positions itself between two earlier strands of self-supervised learning.

Generative models

Generative models (such as Masked Autoencoders: a generative model that learns by reconstructing masked image patches pixel by pixel or diffusion models: generative models that create an image by gradually denoising random noise) reconstruct raw data — pixels or tokens. This works very well for language, but for images and video it forces the model to model irrelevant noise. For long sequences it leads to a "blurry" effect as the model averages all possible versions of the future.

Contrastive models

Contrastive models (such as SimCLR: a contrastive self-supervised method that pulls together augmentations of the same image and pushes apart different images or BYOL: a self-supervised method that reaches compact representations without negative samples, using a target network updated by EMA) learn by comparing an image with its transformed versions and pushing apart the representations of different objects. They operate in semantic space: a representation space where vector proximity reflects similarity of meaning rather than appearance, but require many negative examples and hand-picked augmentations, which can be costly and introduce bias.

JEPA combines the strengths of both. Like contrastive methods it operates in a compact representation space, and like generative models it is predictive — but it needs neither pixel reconstruction nor negative samples. According to DeepLearning.AI, the early I-JEPA reached comparable ImageNet accuracy at several times less compute than a generative Masked Autoencoder.

TraitGenerative modelsContrastive modelsJEPA
Prediction spaceraw pixels / tokensrepresentation spacerepresentation space
Negative samplesnot applicablerequirednot needed
Handling of noisemodels the noise (blurry)ignored via augmentationsabstracts naturally

Key limitations and challenges

The architecture has real limitations worth keeping in mind.

First, some critics note that applying JEPA prediction repeatedly in a temporal loop essentially reduces it to autoregression — only in vector space instead of tokens. There is as yet no proof that such latent autoregression is robust against the accumulation of error over long horizons, a well-known weakness of autoregressive models.

Second, the ability to discard noise can be a drawback. Research suggests that JEPA may lose performance in static environments with a large amount of irregular background noise, where generative models do better.

Third, the vision of fully autonomous agents that plan their own actions raises a safety debate. The intrinsic-motivation mechanisms LeCun envisions for such systems concern some researchers working on AI alignment.

The whole architecture remains experimental — its advantage over the proven scaling of generative models is not yet settled.

Why does it matter?

JEPA matters not because it already outperforms language models, but because it proposes a different path at a moment when almost the entire industry is betting on one idea: scaling autoregressive models on ever larger text corpora. If LeCun is right, raw compute alone will not give machines common sense and an understanding of the physical world — and then a world-model-based approach will be needed.

JEPA's value is therefore partly technical and partly strategic. Technically, it shows that prediction in representation space is a viable, efficient alternative to generating pixels and to costly contrastive methods. Strategically, it keeps pluralism alive in AI research — it ensures that not all of the world's resources flow in a single direction.

For anyone following robotics and embodied AI, this is an architecture worth understanding now, because that is precisely where — in controlling physical machines that learn from observation — its advantage over a purely generative approach seems most tangible. Whether it becomes the foundation of future autonomous machine intelligence remains an open question, but as a research direction it is one of the most serious challenges to the dominance of large language models.

JEPA is not a finished product or a ChatGPT competitor, but an architectural proposal for looking at machine learning from the angle of understanding the world rather than reproducing data. Its fate will be decided in the coming years, in research labs and on the first robots learning to plan from sight alone.

Sources

  • Meta AI — I-JEPA: The first AI model based on Yann LeCun's vision — link
  • Meta AI — V-JEPA 2 world model and benchmarks — link
  • DeepLearning.AI — The Batch: I-JEPA learns by predicting representations — link
  • Reuters — Yann LeCun to leave Meta, launch AI startup focused on Advanced Machine Intelligence — link
  • Reuters — Ex-Meta AI chief Yann LeCun's AMI raises $1.03 billion for alternative AI approach — link
Share this insight

Related topics