Architecture

World Models

Key innovation

Formalizes a paradigm in which an agent learns an internal model of environment dynamics and trains its control policy entirely within that model's generated simulations, dramatically improving sample efficiency.

How it works

The model learns a compressed representation of environment state and can predict future states given actions. The agent can "imagine" action consequences in the world model without executing them in reality, enabling planning and learning in imagination.

Problem solved

AI agents learning directly through environment interaction are sample-inefficient — they need millions of samples. World models allow agents to plan and learn internally, without costly real-world interactions.

Components

Perception Model / Observation Encoder (V)Dimensionality reduction of observations — conversion of raw sensory data into a compact latent representation from

Compresses high-dimensional environment observations (e.g., pixel images) into a low-dimensional latent space representation. In the original World Models (2018), this is implemented via a Variational Autoencoder (VAE). It is responsible for extracting salient spatial features from observations.

Variational Autoencoder (VAE)

Enkoder CNN

RSSM — stochastic and deterministic representation

Official

Environment Dynamics Model (M)Temporal environment dynamics modeling — predicting future latent states conditioned on agent actions

Predicts the next latent states based on the current latent state and the agent's action. It forms the core of the world model — its capacity for temporal extrapolation enables the generation of synthetic trajectories. In the original World Models paper, this component is implemented as an MDN-RNN (Mixture Density Network + LSTM).

MDN-RNN (Mixture Density Network + LSTM)

RSSM (Recurrent State Space Model)

MuZero Dynamics Network

Official

Controller / Policy (C)Maps the internal state (latent + RNN hidden state) to agent actions; optimized against reward within generated trajectories

The agent's decision-making module that maps the current latent state and the hidden state of the dynamics model to actions executed in the environment. In the original World Models architecture, it is compact (linear or a small MLP) and trained separately from the world model — using an evolutionary method (CMA-ES) on generated "dreams".

Linear Controller

Latent-Space Actor-Critic

Official

Imagined Trajectory Generation (Dreaming)Generates synthetic training data for the controller via internal simulation — replacing costly interactions with the real environment.

The mechanism for generating synthetic trajectories by unrolling a dynamics model over time — without interaction with the real environment. The agent "dreams": it initializes a latent state, then sequentially predicts subsequent states by applying the dynamics model and selecting actions via a controller. The resulting sequences are used for policy optimization.

Implementation

Reference implementations

World Models (original Ha & Schmidhuber implementation, TensorFlow)

Python · David Ha

Official

DreamerV3 (Hafner et al., JAX)

Python · Danijar Hafner

Official

PlaNet (Hafner et al., TensorFlow)

Python · Google Research

Official

Implementation pitfalls

Model Exploitation of World Model Imperfections by the AgentCritical

An agent trained exclusively inside an imagined world model may discover policies that achieve high rewards within that imagination but fail to transfer to the real environment — by exploiting the model's prediction errors rather than learning genuine skills.

Fix:Use model temperature (uncertainty injection) to control prediction confidence and penalize overly optimistic imaginations. Regularly validate the policy in the real environment. Apply pessimistic planners that penalize uncertainty.

Prediction Error Accumulation over Long Imagination HorizonsHigh

Errors in the dynamics model accumulate over each step of the imagined trajectory. At long horizons (>20 steps), imagined trajectories can deviate significantly from real ones, degrading policy quality.

Fix:Limit the imagination horizon to values where cumulative errors remain acceptable. Apply uncertainty calibration techniques. Train the dynamics model on diverse inputs, including actions produced by the trained policy (on-policy data).

Catastrophic forgetting in the dynamics model under distribution shiftHigh

When an agent explores previously unseen regions of the environment, the dynamics model may fail to generalize correctly to those states, producing unrealistic imagined trajectories in new parts of the state space.

Fix:Use a replay buffer containing data collected throughout training. Train the dynamics model on a mixture of old and new data. Apply adaptive data collection to ensure adequate state-space coverage.

Difficulty modeling stochastic and multimodal environmentsHigh

Environments with stochastic elements or multimodal distributions over future states are difficult to capture with deterministic dynamics models. Such models tend to average across modes rather than preserving multimodality, resulting in blurry and unreliable predictions.

Fix:Use models with an explicit stochastic component (RSSM, MDN-RNN, diffusion). Model uncertainty via calibrated distributions rather than point predictions. Avoid MSE as the sole reconstruction criterion.

High computational cost in scalable visual environmentsMedium

Training a VAE on pixel images and a dynamics model on imagined sequences demands substantial GPU resources. DreamerV3 on complex environments such as Minecraft requires tens of GPU-days.

Fix:Use low-dimensional state spaces instead of pixels where possible. Compress the latent space aggressively. Apply mixed-precision training and efficient implementations (JAX, TensorRT).

Evolution

Original paper · 2018 · NeurIPS 2018 · David Ha

Recurrent World Models Facilitate Policy Evolution

David Ha, Jürgen Schmidhuber

1990

Schmidhuber — first formal work on RNN-based world models and controllers

Inflection point

Jürgen Schmidhuber published a series of papers (1990a, 1990b, 1991a) formally defining the concept of a learnable world model and a separate controller trained through that model. These works established the foundations of the MBRL paradigm with internal simulation.

2018

Ha & Schmidhuber — World Models: V-M-C with VAE, MDN-RNN, and evolutionary controller

Inflection point

Ha and Schmidhuber formalize and demonstrate a three-component architecture (Vision: VAE, Memory: MDN-RNN, Controller: CMA-ES), showing that a controller can be trained entirely inside the imagined "dreams" of a world model and then transferred to real environments (Car Racing, VizDoom).

Recurrent World Models Facilitate Policy Evolution (paper)

2019

PlaNet (Hafner et al.) — latent-space planning via RSSM

Inflection point

Hafner et al. (Google Brain) propose PlaNet: a world model using a Recurrent State Space Model (RSSM) that combines deterministic and stochastic state transitions. Planning is performed by optimizing latent trajectories via CEM, without an actor model — the first pixel-level demonstration across multiple continuous control environments.

Learning Latent Dynamics for Planning from Pixels (paper)

2020

DreamerV1 (Hafner et al.) — actor-critic trained entirely in imagination

Inflection point

Hafner et al. combine RSSM with an actor-critic optimized solely via backpropagation through imagined trajectories. DreamerV1 outperforms model-free baselines on the DeepMind Control Suite benchmarks.

Dream to Control: Learning Behaviors by Latent Imagination (paper)

2020

MuZero (DeepMind) — world model without observation reconstruction

Inflection point

Schrittwieser et al. (DeepMind) publish MuZero — a world model that learns only rewards, values, and policies without reconstructing observations, combined with MCTS. It achieves human-level performance in Go, Chess, Shogi, and Atari without knowledge of the game rules.

Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model (paper)

2023

DreamerV3 — generalist algorithm across 150+ tasks

Inflection point

Hafner et al. publish DreamerV3 — a generalized version of Dreamer using a single hyperparameter configuration that operates across more than 150 diverse tasks, including diamond collection in Minecraft. This is the first demonstration of world model RL generality across such a broad spectrum of environments.

Mastering Diverse Domains through World Models (paper)

2024

Genie (Google DeepMind) — interactive world model generating environments from video

Bruce et al. (Google DeepMind) publish Genie — a world model trained on unlabeled internet videos, capable of generating interactive 2D environments controlled by learned latent actions. This extends the world models paradigm to generative environment simulators.

Genie: Generative Interactive Environments (paper)

Sources

World Models

World Models

How it works

Problem solved

Components

Implementation

Evolution

Sources

Compute bottleneck

Hyperparameters (configurable axes)

Execution paradigm

Parallelism

Hardware requirements