Reasoning

Model-Based RL

1991ActivePublished: 8 June 2026Updated: 8 June 2026Published

Key innovation

The agent learns a model of environment dynamics and uses it for planning or to generate synthetic experience, drastically improving sample efficiency compared to model-free RL.

How it works

The MBRL loop has three steps repeated iteratively: (1) Data collection — the agent acts in the environment with an exploration policy and stores transitions (s,a,r,s'). (2) Model learning — a dynamics network (deterministic, probabilistic, ensemble, or latent like RSSM) is trained on the collected data to predict s' and r. (3) Model usage — options include planning (CEM/MPC, MCTS in MuZero), training the policy on model rollouts (Dyna, Dreamer), or directly differentiating the policy through the model (analytic policy gradient, SVG, PILCO). The new policy collects more data, the model is updated. Key techniques: ensembles for uncertainty (PETS), finite-horizon planning, KL/regularization against model exploitation, latent representations for high-dimensional observations (Dreamer, RSSM).

Problem solved

Model-free RL requires millions or billions of environment interactions, which is infeasible for real robots and expensive simulators. MBRL drastically reduces the required sample count by learning the policy in "imagination" or by planning using the learned model.

Components

Dynamics modelPredicts s' from (s,a)

A neural network or probabilistic model learning f(s,a) → s'. Can be deterministic, probabilistic (Gaussian), an ensemble, or latent (RSSM).

Deterministic MLPSingle network predicting s'.

Probabilistic ensemble (PETS)Multiple probabilistic networks for epistemic and aleatoric uncertainty estimation.

Latent dynamics (RSSM, Dreamer)Dynamics in latent space, encoder from pixels.

Gaussian Process (PILCO)GP models dynamics with analytic uncertainty propagation.

Official

Reward modelPredicts r(s,a) or r(s)

A reward function usually learned jointly with dynamics; required for planning and imagination-based RL.

Planner / policySelects actions using the model

Decision-making component: a planner (CEM, MPPI, MCTS) or a trained policy (actor-critic in imagination, e.g. Dreamer).

CEM / MPPISample-based planning over a population of trajectories.

MCTS (MuZero, AlphaZero)Monte Carlo tree with value/policy network heuristics.

Actor-critic in imagination (Dreamer)Policy and value trained on model rollouts.

Official

Replay bufferStorage of real transitions (s,a,r,s')

Experience buffer used to train the model and often the policy as well (Dyna).

Implementation

Reference implementations

PETS (handful-of-trials)

DreamerV3 (official, JAX)

Implementation pitfalls

Model exploitationCritical

The action optimizer finds state regions where the model is inaccurate and falsely predicts high reward.

Fix:Ensembles for uncertainty, horizon limits, uncertainty penalties, KL regularization as in Dreamer.

Compounding model errorHigh

Small model errors compound exponentially along long rollouts.

Fix:Short horizons, branched rollouts (MBPO), latent representations that stabilize dynamics.

Distribution shiftHigh

A model trained on real data performs poorly on rollouts generated by the current policy.

Fix:Iteratively retrain the model on new data after every policy update.

Model class vs compute trade-offMedium

GPs (PILCO) scale poorly to high-dim observations; neural ensembles are cheaper but worse at epistemic uncertainty.

Fix:Latent recurrent models (RSSM) for pixels, probabilistic ensembles for low-dim, hybrid approaches.

Evolution

Original paper · 1991 · SIGART Bulletin / AAAI 1991 · Richard S. Sutton

Dyna, an Integrated Architecture for Learning, Planning, and Reacting

Richard S. Sutton

1991

Dyna — first integrated MBRL architecture

Inflection point

Sutton introduces Dyna, integrating model learning, planning, and acting in a single system.

2011

PILCO — probabilistic GP model

Deisenroth & Rasmussen show that a Gaussian Process dynamics model achieves record sample efficiency on control tasks.

PILCO: A Model-Based and Data-Efficient Approach to Policy Search (paper)

2018

PETS — probabilistic ensembles + CEM

Chua et al. establish a strong MBRL baseline with a probabilistic ensemble and CEM planning.

Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models (paper)

2019

PlaNet — first SoTA-level pixel MBRL

Inflection point

Hafner et al. introduce RSSM and demonstrate effective latent-space planning from raw pixels.

RSSM (concept)

2019

MuZero — MBRL without knowing the game rules

Inflection point

DeepMind shows that an agent learning its own model matches AlphaZero in Go, chess, and Atari without access to environment rules.

Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model (paper)

2019

MBPO — uncertainty-aware Dyna

Janner et al. match SAC performance with an order-of-magnitude fewer samples.

When to Trust Your Model: Model-Based Policy Optimization (paper)

2020

Dreamer / DreamerV2

Actor-critic training in imagination over RSSM reaches human-level Atari on a single GPU.

RSSM (concept)

2023

DreamerV3 — universal MBRL

Inflection point

A single MBRL agent configuration achieves strong results on 150+ tasks (Atari, DMC, Minecraft, Crafter) without tuning.

Mastering Diverse Domains through World Models (paper)

2022

TD-MPC — combining planning with value learning

Hansen et al. combine short-horizon MPC with a learned value function, reaching SoTA on DMC.

Temporal Difference Learning for Model Predictive Control (paper)

Sources

Dyna, an Integrated Architecture for Learning, Planning, and Reacting

Paper

ACM SIGART

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Paper

ICML 2011

Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models (PETS)

Paper

arXiv / NeurIPS 2018

Learning Latent Dynamics for Planning from Pixels (PlaNet)

Paper

arXiv / ICML 2019

Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model (MuZero)

Paper

arXiv / Nature 2020

When to Trust Your Model: Model-Based Policy Optimization (MBPO)

Paper

arXiv / NeurIPS 2019

Mastering Diverse Domains through World Models (DreamerV3)

Paper

arXiv

Temporal Difference Learning for Model Predictive Control (TD-MPC)

Paper

arXiv / ICML 2022

Model-Based RL

How it works

Problem solved

Components

Implementation

Evolution

Sources

Hyperparameters (configurable axes)

Execution paradigm

Parallelism

Hardware requirements