VLA-JEPA: latent world model for robots, not pixel prediction

A team from USTC, Zhongguancun Academy, SJTU, and Eastern Institute of Technology Ningbo has released VLA-JEPA — the first framework combining Vision-Language-Action models with a latent world model, integrated into the LeRobot library. Rather than predicting future frames in pixel space, the model learns world dynamics in latent space, following the JEPA approach proposed by Yann LeCun. Accepted to ECCV 2026, the work earned retweets from LeCun and Saining Xie, and demonstrated that just 13 robot trajectories suffice for simple assembly tasks.

Key takeaways

VLA-JEPA is a JEPA-style pretraining framework for VLA models — the first ported to Hugging Face's LeRobot library
Backbone: Qwen3-VL + V-JEPA2 encoder — predicts future state in latent space rather than pixels
LIBERO: 97.2% average success rate (top score on Object and LIBERO-10 suites)
LIBERO-Plus (OOD with 7 perturbation types): 78.1% — first place in 5 of 7 dimensions
Code, weights, and data publicly available on GitHub (ginwind/VLA-JEPA) and Hugging Face

Four classes of problems with existing methods

VLA models face a persistent data problem: real robot trajectories are expensive to collect, limited in scale, and narrow in task coverage. Latent action methods try to work around this by pretraining on unlabeled videos — human internet recordings. The issue is that existing methods use future frames as supervision, and video largely encodes lighting changes, background shifts, and camera motion rather than actual manipulation actions.

VLA-JEPA identifies four problem classes. First, pixel-level objectives push representations toward appearance rather than dynamics. Second, motion noise in internet videos dominates over the actual manipulation signal. Third, information leakage — using both current and future observations simultaneously — degenerates latent action into next-frame compression. Fourth, multi-stage pipelines are prone to inconsistencies between training phases.

The solution is elegant: the future frame stops being a model input and becomes purely a supervision signal. A target encoder encodes the future state and serves as the alignment target, while the predictor operates on the current state and a latent action token — without access to the future. This closure of the leakage channel forces the latent action to genuinely encode "why will the state change," not "what does the next frame look like."

Architecture: Qwen3-VL with V-JEPA2 and flow matching

The model uses Qwen3-VL as the vision-language backbone. Video frames pass through V-JEPA2 and are mapped to world state representations. A learnable latent action token represents the transition between states. The predictor, given the current state and the action token, predicts the future latent state — aligned with the target encoder's output, not with pixels.

Training proceeds in two phases. Phase one uses human videos — roughly 220k recordings from Something-Something-v2 — minimizing a latent world model alignment loss. Phase two attaches a flow matching-based action head and trains on robot data (roughly 76k trajectories from DROID), adding an action prediction loss. Together, this yields a simpler two-stage pipeline compared to competitors' three-stage approaches.

LIBERO and LIBERO-Plus: when robustness matters

On the LIBERO benchmark, VLA-JEPA achieves a 97.2% average success rate — the highest on Object and LIBERO-10 suites — with less robot data than strong baselines like OpenVLA-OFT (97.1%) and pi0 (94.2%).

The real test is LIBERO-Plus, which introduces seven types of distribution shifts: camera, robot, language, lighting, background, noise, and layout. VLA-JEPA ranked first in 5 of 7 dimensions, reaching 78.1% on average — versus 69.6% for OpenVLA-OFT and 61.6% for pi0-Fast. The authors interpret this as evidence that latent action encodes state changes rather than visual patterns.

On SimplerEnv, results are more mixed: 65.2% for Google Robot and 57.3% for WidowX. Notably, removing human video from training improved performance on several visual matching tasks — a signal that human video does not create new motor skills but stabilizes existing ones.

13 trajectories and second-attempt grasping: human video's side effect

On a real Franka FR3 arm (Robotiq 2F-85 gripper, three D435 cameras), training used 100 demonstrations across three task classes. VLA-JEPA consistently reopened its gripper and attempted a second grasp after a failed first attempt — behavior that pi0 and pi0.5 did not exhibit reliably. The authors attribute this to knowledge encoded from human recordings, where retry behavior after failure is common.

This is — as the paper argues — the most valuable contribution of human video: not generating new control skills, but adding common sense about what to do when things go wrong. The fact that just 13 trajectories sufficed for simple assembly tasks in LeRobot suggests strong knowledge transfer from pretraining.

Why this matters

VLA-JEPA poses a concrete question: how do you scale robot models without proportional growth in expensive trajectory collection? The key reorientation is treating human video as a source of world dynamics priors rather than a poor substitute for action-labeled data. That is a semantically different question, leading to a different architecture — and a different kind of robustness.

Latent prediction objectives are more robust to visual perturbations than pixel-based approaches not because they actively filter noise, but because V-JEPA2, trained on video alone, selectively encodes changes relevant to causality. The model doesn't memorize what a scene looks like — it learns how the scene's state changes in response to actions. That qualitative difference is visible in the LIBERO-Plus results.

Integration with LeRobot — Hugging Face's popular robotics library — lowers the barrier for reproduction and community-driven extension.

What comes next

Code, weights, and data are published on GitHub (ginwind/VLA-JEPA) and Hugging Face — enabling full reproduction from day one
ECCV 2026 will be the direct forum for confrontation with other latent action methods, including those based on V-JEPA2 and future DROID iterations
Open research question: whether latent dynamics built from internet videos maintain their advantage at full-scale pretraining (above 1 million robot trajectories)

Sources

arXiv — VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model
GitHub — ginwind/VLA-JEPA
VLA-JEPA — Project page
Jiqizhi (机器之心) — LeCun and Saining Xie retweet: World Model + VLA fusion from Zhongguancun Academy, ECCV 2026