Genie 2 is a foundation world model developed by Google DeepMind, introduced on 4 December 2024. The model generates action-controllable, playable 3D environments based on a single prompt image (for example generated by Imagen 3) โ and can be played by a human or an AI agent using keyboard and mouse inputs.
Architecture
Genie 2 is an autoregressive latent-diffusion model trained on a large video dataset. After passing through an autoencoder, latent video frames are fed to a large transformer dynamics model trained with a causal mask, analogous to that used by large language models. At inference time, Genie 2 is sampled autoregressively, taking individual actions and past latent frames on a frame-by-frame basis. Classifier-free guidance is used to improve action controllability.
Capabilities
Genie 2 produces consistent worlds for up to about one minute (most demonstration examples last 10โ20 s) and exhibits a range of emergent properties: identifying the controllable character in a scene, generating counterfactual trajectories from the same starting frame, long-horizon memory (correctly rendering regions briefly out of view), character animation, NPC modelling, physics effects (water, smoke, gravity), point and directional lighting, reflections, bloom and object interactions with proper affordances (opening doors, popping balloons). The model also works with real-world photographs as prompts.
Research applications
Genie 2 is used to generate an unlimited curriculum of novel worlds for training and evaluating embodied agents. In the DeepMind release, the SIMA agent was shown navigating Genie 2-synthesised environments from a single prompt image, controlled via natural-language instructions, with Genie 2 acting as a frame-by-frame simulator that responds to SIMA's actions. The model also enables rapid prototyping of scenes and visual concepts by artists and designers.
Reference and real-time variants
The samples in the announcement post come from an undistilled base model (highest quality). DeepMind also reports a distilled real-time playable version with reduced output quality. The model weights are not publicly released.