The model learns a compressed representation of environment state and can predict future states given actions. The agent can "imagine" action consequences in the world model without executing them in reality, enabling planning and learning in imagination.
AI agents learning directly through environment interaction are sample-inefficient — they need millions of samples. World models allow agents to plan and learn internally, without costly real-world interactions.
Compresses high-dimensional environment observations (e.g., pixel images) into a low-dimensional latent space representation. In the original World Models (2018), this is implemented via a Variational Autoencoder (VAE). It is responsible for extracting salient spatial features from observations.
Official
Predicts the next latent states based on the current latent state and the agent's action. It forms the core of the world model — its capacity for temporal extrapolation enables the generation of synthetic trajectories. In the original World Models paper, this component is implemented as an MDN-RNN (Mixture Density Network + LSTM).
Official
The agent's decision-making module that maps the current latent state and the hidden state of the dynamics model to actions executed in the environment. In the original World Models architecture, it is compact (linear or a small MLP) and trained separately from the world model — using an evolutionary method (CMA-ES) on generated "dreams".
Official
The mechanism for generating synthetic trajectories by unrolling a dynamics model over time — without interaction with the real environment. The agent "dreams": it initializes a latent state, then sequentially predicts subsequent states by applying the dynamics model and selecting actions via a controller. The resulting sequences are used for policy optimization.
An agent trained exclusively inside an imagined world model may discover policies that achieve high rewards within that imagination but fail to transfer to the real environment — by exploiting the model's prediction errors rather than learning genuine skills.
Errors in the dynamics model accumulate over each step of the imagined trajectory. At long horizons (>20 steps), imagined trajectories can deviate significantly from real ones, degrading policy quality.
When an agent explores previously unseen regions of the environment, the dynamics model may fail to generalize correctly to those states, producing unrealistic imagined trajectories in new parts of the state space.
Environments with stochastic elements or multimodal distributions over future states are difficult to capture with deterministic dynamics models. Such models tend to average across modes rather than preserving multimodality, resulting in blurry and unreliable predictions.
Training a VAE on pixel images and a dynamics model on imagined sequences demands substantial GPU resources. DreamerV3 on complex environments such as Minecraft requires tens of GPU-days.
Jürgen Schmidhuber published a series of papers (1990a, 1990b, 1991a) formally defining the concept of a learnable world model and a separate controller trained through that model. These works established the foundations of the MBRL paradigm with internal simulation.
Ha and Schmidhuber formalize and demonstrate a three-component architecture (Vision: VAE, Memory: MDN-RNN, Controller: CMA-ES), showing that a controller can be trained entirely inside the imagined "dreams" of a world model and then transferred to real environments (Car Racing, VizDoom).
Hafner et al. (Google Brain) propose PlaNet: a world model using a Recurrent State Space Model (RSSM) that combines deterministic and stochastic state transitions. Planning is performed by optimizing latent trajectories via CEM, without an actor model — the first pixel-level demonstration across multiple continuous control environments.
Hafner et al. combine RSSM with an actor-critic optimized solely via backpropagation through imagined trajectories. DreamerV1 outperforms model-free baselines on the DeepMind Control Suite benchmarks.
Schrittwieser et al. (DeepMind) publish MuZero — a world model that learns only rewards, values, and policies without reconstructing observations, combined with MCTS. It achieves human-level performance in Go, Chess, Shogi, and Atari without knowledge of the game rules.
Hafner et al. publish DreamerV3 — a generalized version of Dreamer using a single hyperparameter configuration that operates across more than 150 diverse tasks, including diamond collection in Minecraft. This is the first demonstration of world model RL generality across such a broad spectrum of environments.
Bruce et al. (Google DeepMind) publish Genie — a world model trained on unlabeled internet videos, capable of generating interactive 2D environments controlled by learned latent actions. This extends the world models paradigm to generative environment simulators.
The dynamics model (RNN/RSSM) requires sequential processing of time steps, which limits parallelism during training. With long imagination horizons (e.g., 15–50 steps in Dreamer), the cost of training the actor-critic via backpropagation through the unrolled dynamics model becomes the dominant computational expense.
The size of the latent vector generated by the observation encoder. It determines representation capacity and information compression. Too small — information loss; too large — slower controller training.
The size of the RNN or RSSM hidden state determines the capacity of the dynamics model to retain history and predict future states.
The number of timesteps simulated internally by the dynamics model when generating an imagined trajectory for policy training. A longer horizon improves long-term planning at the cost of increased compounding errors and computational overhead.
The architecture used to model transitions between latent states over time. This choice determines the model's ability to capture the complexity of environment dynamics.
Standard world models (VAE + RNN/RSSM + controller) use dense neural networks without routing or sparse activation. MuZero relies solely on a deterministic dynamics network without observation reconstruction — an architecturally simplified approach, but still dense.
Training the perception model (encoder) is fully parallel (batch processing). Training the dynamics model (RNN/RSSM) is sequential along the time dimension, but parallel across batch elements.
Training world models — particularly the encoder (VAE/CNN), dynamics model (RNN/RSSM/Transformer), and actor-critic — is dominated by matrix operations executed efficiently by GPU tensor cores. DreamerV3 is trained on V100/A100 GPUs.