What is a world model?
A world model is an artificial intelligence system that builds an internal, compressed representation of its surroundings and can predict the environment's future states based on current observations and planned actions.
The inspiration comes from how humans and animals navigate the world: a baseball player begins a swing before the brain has fully processed the image of the incoming ball — reacting not to reality, but to its own prediction of a future state.
It is worth stating up front what a world model is not:
- Not a single language model or an ordinary video generator. It is a class of architectures and approaches — from reinforcement learning algorithms, through predictive models, to generators of interactive 3D environments. The common denominator: instead of mapping input to output blindly, the system maintains an internal simulator of the environment's dynamics and uses it to plan.
- Not a foundation model. "Foundation" describes a model's scale and role in the ecosystem — large, pretrained on broad data, adaptable to many tasks. "World model" describes its function: an internal, predictive simulator of dynamics. The categories overlap — a foundation model from a family such as Cosmos or V-JEPA can be a world model, but the vast majority of foundation models, including typical language models, do not causally simulate the world and are not world models.
The key difference from classic reinforcement learning lies in efficiency. Traditional "model-free" systems learned by trial and error directly in the environment, which is extremely costly and builds no understanding of cause-and-effect relationships. A world model lets an agent "dream" future events inside its own neural network before taking any real action.
Who is behind it?
The foundations of the field were laid in 2018 by David Ha and Jürgen Schmidhuber with the paper "World Models" (formally Recurrent World Models Facilitate Policy Evolution). They defined the modern framework for the entire approach.
Today the most important AI labs work on world models, each in a slightly different direction. Danijar Hafner at Google DeepMind develops the Dreamer family of algorithms. Yann LeCun, Chief AI Scientist at Meta, champions the JEPA architecture as an alternative to generating pixels. Google DeepMind builds the Genie interactive-world generators, OpenAI calls its Sora video model a "world simulator," and NVIDIA develops the Cosmos platform for robotics. A separate, radical path was taken by Fei-Fei Li — co-creator of the ImageNet dataset — whose startup World Labs bets on "spatial intelligence" and 3D models. Autonomous vehicles are the focus of the British company Wayve with its GAIA-1 model.
How does it work?
The whole world-model mechanism boils down to three steps:
- Compression. First, the system compresses high-dimensional input — video frames, for example — into a compact vector in a so-called latent space, i.e. an internal representation made of just a handful of numbers, where the model stores the meaning of the scene rather than its pixels. This compression is not just about saving memory: it forces the model to abstract — discarding irrelevant detail (the colour of a wall, clouds drifting in the background) and keeping what truly governs the dynamics of the scene.
- Prediction. Next, a separate component predicts how that latent state will change in the following step — taking into account the history of previous states and the action the agent intends to take. This is the heart of a world model: a prediction engine that operates not on raw pixels but on abstract representations, which makes planning many times faster and cheaper.
- Decision. Finally, a simple decision network selects an action based on the current state and the predictions. Because the model can simulate the consequences of any action, the agent can be fully disconnected from the real environment and trained entirely inside the generated predictions — this is called "learning in dreams": the system plays out millions of virtual interactions without the physical limits of time, then carries the learned policy back to the real world.
What are its key components?
The classic architecture from Ha and Schmidhuber's paper splits the system into three cooperating components:
- V (Vision) — perception. Usually a variational autoencoder (VAE) that compresses the image into a low-dimensional latent vector.
- M (Memory) — memory and dynamics. Originally a recurrent network (RNN), today more often a Transformer or a state-space model. It models the passage of time and predicts the next latent state from history and action.
- C (Controller) — controller. A simple network (or a linear model) that, based on the state from V and predictions from M, decides on an action.
This split still forms the skeleton of most implementations. Newer variants modify it: DreamerV3 relies on an RSSM (Recurrent State-Space Model) structure and compresses observations into discrete categorical representations rather than continuous distributions, which increases expressiveness. JEPA, in turn, deliberately removes the pixel-decoding step — the predictor guesses the abstract embeddings of future fragments, not their exact appearance.
What can it be used for?
- Robotics and physical AI (insight) — the most mature application. Robots suffer from a data bottleneck: the internet has billions of words to train language models, but video recordings contain no information about the forces acting on a robot arm. World models such as V-JEPA 2 or NVIDIA Cosmos enable "zero-shot?zero-shot: Performing a new task with zero training examples for it — the model succeeds straight away, drawing on general knowledge." robot control — a machine performs tasks in a new environment because it can abstractly plan the consequences of a movement, instead of learning them through costly trials on physical hardware.
- Autonomous vehicles. The industry struggles with the "long tail" of rare road situations that are hard to collect from real driving. Waymo used Genie technology to build its own simulator generating realistic, extreme scenarios. Wayve uses its GAIA-1 model not only for testing but also as a decision core anticipating the behaviour of pedestrians and cyclists.
- AI agents, games and virtual environments. Worlds generated on the fly — like Genie or World Labs' Marble model — serve to train agents and to rapidly create interactive 3D spaces from a text description alone.
How does it differ from other approaches?
The most important difference concerns a split into three paradigms that are easy to confuse.
Video generators (like Sora) produce astonishingly realistic clips, and their makers claim the model picks up the physics of the world along the way — on its own, just from the sheer scale of training. Critics are sceptical: a smooth, good-looking video still isn't an understanding of what is happening and why. In Sora's early demos, chairs floated for no reason and a candle flame didn't react to being blown on.
Predictive models without generation (JEPA) go in the opposite direction. Yann LeCun argues that predicting every pixel is a waste of compute and fails in uncertain environments — the model does not need to know exactly what a passer-by's shirt looks like; it is enough that it "understands" the person is moving. That is why JEPA predicts in an abstract space, without reconstructing the image.
Spatial intelligence models (World Labs, the Marble model) immediately generate a persistent, fully geometric 3D environment based on Gaussian splatting, which can be edited and exported to game engines. Fei-Fei Li argues that 2D video generators are not true world models at all, because they lack structural knowledge of three dimensions.
Key limitations and challenges
The first problem is representation collapse in JEPA-type architectures. Since the network is rewarded for matching its prediction to the original embedding, the simplest — and useless — strategy is to map everything to a single constant vector. Methods that prevent this still rely on heuristics — practical rules of thumb that usually work but lack solid theoretical grounding. That makes scaling harder.
The second challenge is cost and scalability. Real-time interactive simulators such as Genie 3 run on powerful clusters and consume enormous resources at inference time. Dense 3D structures in models like Marble also place high demands on hardware.
The third problem is the sim-to-real gap and hallucinations. Even an advanced model can "invent" mechanisms in its latent space that break the laws of physics. If a trained agent discovers and exploits such a simulator flaw, its policy turns out to be useless or dangerous once transferred to a real robot. Minimising this gap is one of the research priorities.
Why does it matter?
World models are currently one of the most seriously considered answers to the question of what comes after language models. The argument increasingly arises that language itself operates on a shallow, discrete layer of intelligence — enough for conversation, but not enough to solve Moravec's paradox: performing seemingly simple but physically complex sensorimotor tasks. A robot that is meant to deftly grasp objects or safely drive a car needs more than predicting the next word.
Interestingly, despite a sharp dispute over method — pixels versus abstraction, 2D video versus 3D structure — Fei-Fei Li, Yann LeCun and OpenAI agree on the direction. The future of advanced AI is meant to rest on machines' ability to build rigorous models of their surroundings, plan over the long term under uncertainty, and reason in terms of cause and effect. For a reader following AI's development, this is a meaningful signal: the next wave of progress may play out not in chatbots, but in systems that truly understand how the physical world works — and can predict its next state.
World models are not yet a mature, ready technology, but a coherent vision linking robotics, autonomous vehicles and AI agents. The dispute over how best to build them is only being settled now — and that is precisely why it is worth understanding what is at stake.
Sources
- David Ha, Jürgen Schmidhuber — "World Models" (2018) — worldmodels.github.io
- Google DeepMind — "Mastering Diverse Domains through World Models" (DreamerV3) — arxiv.org
- Yann LeCun — "A Path Towards Autonomous Machine Intelligence" (2022) — openreview.net
- Google DeepMind — "Genie: Generative Interactive Environments" — deepmind.google
- OpenAI — "Video generation models as world simulators" (Sora) — openai.com
- Wayve — "GAIA-1: A Generative World Model for Autonomous Driving" — wayve.ai
