The MBRL loop has three steps repeated iteratively: (1) Data collection — the agent acts in the environment with an exploration policy and stores transitions (s,a,r,s'). (2) Model learning — a dynamics network (deterministic, probabilistic, ensemble, or latent like RSSM) is trained on the collected data to predict s' and r. (3) Model usage — options include planning (CEM/MPC, MCTS in MuZero), training the policy on model rollouts (Dyna, Dreamer), or directly differentiating the policy through the model (analytic policy gradient, SVG, PILCO). The new policy collects more data, the model is updated. Key techniques: ensembles for uncertainty (PETS), finite-horizon planning, KL/regularization against model exploitation, latent representations for high-dimensional observations (Dreamer, RSSM).
Model-free RL requires millions or billions of environment interactions, which is infeasible for real robots and expensive simulators. MBRL drastically reduces the required sample count by learning the policy in "imagination" or by planning using the learned model.
A neural network or probabilistic model learning f(s,a) → s'. Can be deterministic, probabilistic (Gaussian), an ensemble, or latent (RSSM).
Official
A reward function usually learned jointly with dynamics; required for planning and imagination-based RL.
Decision-making component: a planner (CEM, MPPI, MCTS) or a trained policy (actor-critic in imagination, e.g. Dreamer).
Official
Experience buffer used to train the model and often the policy as well (Dyna).
The action optimizer finds state regions where the model is inaccurate and falsely predicts high reward.
Small model errors compound exponentially along long rollouts.
A model trained on real data performs poorly on rollouts generated by the current policy.
GPs (PILCO) scale poorly to high-dim observations; neural ensembles are cheaper but worse at epistemic uncertainty.
Sutton introduces Dyna, integrating model learning, planning, and acting in a single system.
Deisenroth & Rasmussen show that a Gaussian Process dynamics model achieves record sample efficiency on control tasks.
Chua et al. establish a strong MBRL baseline with a probabilistic ensemble and CEM planning.
Hafner et al. introduce RSSM and demonstrate effective latent-space planning from raw pixels.
DeepMind shows that an agent learning its own model matches AlphaZero in Go, chess, and Atari without access to environment rules.
Janner et al. match SAC performance with an order-of-magnitude fewer samples.
Actor-critic training in imagination over RSSM reaches human-level Atari on a single GPU.
A single MBRL agent configuration achieves strong results on 150+ tasks (Atari, DMC, Minecraft, Crafter) without tuning.
Hansen et al. combine short-horizon MPC with a learned value function, reaching SoTA on DMC.
Number of forward-planned steps. Too long → model-error compounding; too short → myopic behavior.
Choice between deterministic, probabilistic, ensemble, GP, or latent (RSSM) models.
Ratio of real to synthetic samples in policy training. Crucial in Dyna/MBPO-style methods.
Number of models in the ensemble (PETS uses 5-10) — affects uncertainty estimation.
CEM / MPPI / MCTS / random shooting / actor-critic in imagination.
The full dynamics, reward, and policy/planner networks are active at each planning step.
Model training is fully batch-parallel. Imagination rollouts can be parallelized across the batch. The sequential bottlenecks are the real environment step and the inner recurrence step of the model.
Model and policy training plus imagination rollouts — an A100/V100-class GPU suffices even for complex Dreamer/TD-MPC tasks.
Classical MBRL with CEM/MPPI planners is often run on CPUs with massively parallel sample-based search.
DreamerV3 in JAX/XLA scales efficiently on TPU.