Robotics

WAM

2025ExperimentalPublished

Key innovation

Unified training of world prediction and action generation in a single autoregressive transformer — the model jointly learns physical dynamics (future visual observations) and robot policy (action sequences), enabling richer embodied representations without separate world-model and policy networks.

Category

Robotics

Abstraction level

Pattern

Operation level

ModelRobot controlTraining

Use cases

Long-horizon robot task planningSim-to-real transfer for manipulationGeneralist robot policies across environmentsInteractive robot assistants with goal-directed behaviorResearch on unified perception-action-prediction architectures

How it works

The model jointly learns to predict future world states and to generate action sequences conditioned on goals. It maintains an internal latent representation of the world that is updated as actions are taken, enabling multi-step planning grounded in physical consequences.

Problem solved

Current robot AI systems either plan abstractly (no motor grounding) or react reactively (no long-horizon planning). World Action Models unify predictive world modeling with direct action generation in a single architecture.

Components

Visual tokenizer

Action head

Future-frame decoder

Language conditioning

Implementation

Reference implementations

WorldVLA (Alibaba DAMO Academy)

Video Prediction Policy

Implementation pitfalls

Kolaps na łatwiejsze zadanieCritical

Słaba tokenizacja akcjiHigh

Wysoki koszt obliczeniowy treninguHigh

Sim-to-real gap w rolloutMedium

Evolution

Original paper · 2025 · Jun Cen

WorldVLA: Towards Autoregressive Action World Model

Jun Cen, et al. (Alibaba DAMO Academy)

2018

World Models (concept)

2023

2024

2025

2025

Sources

WorldVLA: Towards Autoregressive Action World Model

Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

arXiv / ICML 2025

arXiv (Ha & Schmidhuber)