At each step t the encoder maps observation o_t to features. The deterministic path computes h_t = GRU(h_{t-1}, [s_{t-1}, a_{t-1}]). The stochastic path keeps two heads: a prior p(s_t | h_t) (used during imagination / planning) and a posterior q(s_t | h_t, e_t) (used during training from real observations). A decoder reconstructs o_t from [h_t, s_t] and separate heads predict reward and (in Dreamer) value/policy. Training maximizes an ELBO: reconstruction of observation and reward minus the KL between posterior and prior. Once trained, policy and value are learned on imagined trajectories generated by the prior (latent-space rollouts).
Purely deterministic recurrent models struggle to represent environment stochasticity, while purely stochastic models tend to lose information across long horizons. RSSM combines both to obtain stable long-range memory (h_t) and explicit uncertainty modeling (s_t), enabling effective latent-space planning and policy learning from pixels.
GRU hidden state updated as h_t = GRU(h_{t-1}, [s_{t-1}, a_{t-1}]). Provides a stable information flow over time.
Official
Stochastic latent variable from a conditional distribution (usually diagonal Gaussian, categorical in DreamerV2/V3) representing the observable state.
Official
Network p(s_t | h_t) used during imagination/planning when no real observation is available.
Network q(s_t | h_t, e_t) using observation features e_t = encoder(o_t). Used during training.
CNN for pixels or MLP for low-dim states; output is fed to the posterior.
Transposed CNN reconstructing o_t; reconstruction loss shapes the latent representation.
MLP predicting the scalar reward for the current state.
Without free nats or KL balancing, the posterior s_t collapses to the prior and the model loses the ability to represent observations.
If imagined rollouts use a prior that diverges from the posterior, policies trained in imagination fail to transfer to the environment.
Gradients through the categorical latent require a straight-through estimator; careless implementations break gradient scaling.
Image reconstruction loss can dominate the reward signal, yielding representations that are not useful for the policy.
Hafner et al. release the preprint "Learning Latent Dynamics for Planning from Pixels", introducing RSSM and latent-space CEM planning.
Hafner et al. replace CEM planning with actor-critic learning in imagination over RSSM, launching the Dreamer family.
Replacing the Gaussian with 32×32 categorical latents and adding KL balancing enables human-level Atari performance on a single GPU.
A single RSSM configuration achieves strong results on 150+ tasks (DMC, Atari, Minecraft, Crafter) without per-task tuning.
DreamerV3 with RSSM is the first algorithm to autonomously collect a diamond in Minecraft without human data or curriculum.
Size of the GRU hidden vector (e.g. 200 in PlaNet, 600 in DreamerV1, 4096 in DreamerV3 large).
Gaussian dimensionality or categorical configuration (e.g. 32×32 in DreamerV2/V3).
KL balancing coefficient between prior and posterior updates (DreamerV2+).
Threshold below which KL is ignored, prevents posterior collapse.
Length of imagined rollouts used to train the policy (e.g. 15 in Dreamer).
All heads (deterministic, prior, posterior, decoder, reward) are active at every step.
The update h_t = GRU(h_{t-1}, ...) is inherently sequential in time. Within a batch, rollouts can be parallelized, but the time axis cannot be trivially parallelized (unlike attention in Transformers).
Relatively small networks with heavy batched imagination rollouts — a single V100/A100-class GPU is enough to train DreamerV2/V3.
The reference DreamerV3 implementation in JAX/XLA scales well on TPU.