Architecture

RSSM

2019ActivePublished: 8 June 2026Updated: 8 June 2026Published

Key innovation

Combines a deterministic recurrent path (GRU) with a stochastic latent variable inside a single latent dynamics model, enabling image-based model-based RL and latent-space planning (PlaNet, Dreamer).

How it works

At each step t the encoder maps observation o_t to features. The deterministic path computes h_t = GRU(h_{t-1}, [s_{t-1}, a_{t-1}]). The stochastic path keeps two heads: a prior p(s_t | h_t) (used during imagination / planning) and a posterior q(s_t | h_t, e_t) (used during training from real observations). A decoder reconstructs o_t from [h_t, s_t] and separate heads predict reward and (in Dreamer) value/policy. Training maximizes an ELBO: reconstruction of observation and reward minus the KL between posterior and prior. Once trained, policy and value are learned on imagined trajectories generated by the prior (latent-space rollouts).

Problem solved

Purely deterministic recurrent models struggle to represent environment stochasticity, while purely stochastic models tend to lose information across long horizons. RSSM combines both to obtain stable long-range memory (h_t) and explicit uncertainty modeling (s_t), enabling effective latent-space planning and policy learning from pixels.

Components

Deterministic recurrent state (GRU)Long-range deterministic memory

GRU hidden state updated as h_t = GRU(h_{t-1}, [s_{t-1}, a_{t-1}]). Provides a stable information flow over time.

Official

Stochastic latent stateEnvironment uncertainty representation

Stochastic latent variable from a conditional distribution (usually diagonal Gaussian, categorical in DreamerV2/V3) representing the observable state.

Gaussian latent (PlaNet, DreamerV1)Diagonal Gaussian distribution.

Categorical latent (DreamerV2, DreamerV3)32 categorical variables with 32 classes each, straight-through gradients.

Official

Transition priorPredict s_t without observation

Network p(s_t | h_t) used during imagination/planning when no real observation is available.

Representation posteriorInfer s_t from observation

Network q(s_t | h_t, e_t) using observation features e_t = encoder(o_t). Used during training.

Observation encoderMaps observations (e.g. images) to features

CNN for pixels or MLP for low-dim states; output is fed to the posterior.

Observation decoderReconstructs observations from [h_t, s_t]

Transposed CNN reconstructing o_t; reconstruction loss shapes the latent representation.

Reward headPredicts reward r_t from [h_t, s_t]

MLP predicting the scalar reward for the current state.

Implementation

Reference implementations

PlaNet (official, TensorFlow)

Python

Official

Dreamer (official, TensorFlow)

DreamerV3 (official, JAX)

Python (JAX)

Official

Implementation pitfalls

Posterior collapseHigh

Without free nats or KL balancing, the posterior s_t collapses to the prior and the model loses the ability to represent observations.

Fix:Free nats (PlaNet) or KL balancing (DreamerV2+).

Prior–posterior mismatchHigh

If imagined rollouts use a prior that diverges from the posterior, policies trained in imagination fail to transfer to the environment.

Fix:KL balancing that pushes updates more strongly into the prior; symlog/return normalization in DreamerV3.

Categorical latent training stability (DreamerV2/V3)Medium

Gradients through the categorical latent require a straight-through estimator; careless implementations break gradient scaling.

Fix:Use the reference straight-through Gumbel-softmax implementation from DreamerV2/V3.

Pixel reconstruction dominanceMedium

Image reconstruction loss can dominate the reward signal, yielding representations that are not useful for the policy.

Fix:Loss-scale weighting and normalization as in DreamerV3; alternatively contrastive RSSM (DreamerPro).

Evolution

Original paper · 2019 · ICML 2019 · Danijar Hafner

Learning Latent Dynamics for Planning from Pixels

Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, James Davidson

2018

PlaNet — RSSM preprint

Inflection point

Hafner et al. release the preprint "Learning Latent Dynamics for Planning from Pixels", introducing RSSM and latent-space CEM planning.

Learning Latent Dynamics for Planning from Pixels (paper)

2019

Dreamer (DreamerV1)

Inflection point

Hafner et al. replace CEM planning with actor-critic learning in imagination over RSSM, launching the Dreamer family.

Dream to Control: Learning Behaviors by Latent Imagination (paper)

2020

DreamerV2 — categorical latent + KL balancing

Inflection point

Replacing the Gaussian with 32×32 categorical latents and adding KL balancing enables human-level Atari performance on a single GPU.

Mastering Atari with Discrete World Models (paper)

2023

DreamerV3 — universal hyperparameters

Inflection point

A single RSSM configuration achieves strong results on 150+ tasks (DMC, Atari, Minecraft, Crafter) without per-task tuning.

Mastering Diverse Domains through World Models (paper)

2023

DreamerV3 — first from-scratch diamond collection in Minecraft

DreamerV3 with RSSM is the first algorithm to autonomously collect a diamond in Minecraft without human data or curriculum.

Sources

Learning Latent Dynamics for Planning from Pixels (PlaNet)

Paper

arXiv / ICML 2019

Dream to Control: Learning Behaviors by Latent Imagination (DreamerV1)

Paper

arXiv / ICLR 2020

Mastering Atari with Discrete World Models (DreamerV2)

Paper

arXiv / ICLR 2021

Mastering Diverse Domains through World Models (DreamerV3)

Paper

arXiv

DreamerV3 official code

Repository

GitHub

RSSM

How it works

Problem solved

Components

Implementation

Evolution

Sources

Hyperparameters (configurable axes)

Execution paradigm

Parallelism

Hardware requirements