Robots Atlas>ROBOTS ATLAS
Genie 2

Genie 2

2ย ยทย Family: Genie
Foundation world model from Google DeepMind that generates action-controllable 3D environments from a single prompt image. Consistent worlds for up to about a minute, controlled with keyboard and mouse.
๐Ÿ”ฌ Research๐Ÿ”ฌ Research onlyWorld Model๐Ÿ“ Genie
Release date
4 December 2024
Access:HostedDeployment:โ˜ Cloud

Overview

Genie 2 is a foundation world model developed by Google DeepMind, introduced on 4 December 2024. The model generates action-controllable, playable 3D environments based on a single prompt image (for example generated by Imagen 3) โ€” and can be played by a human or an AI agent using keyboard and mouse inputs.

Architecture

Genie 2 is an autoregressive latent-diffusion model trained on a large video dataset. After passing through an autoencoder, latent video frames are fed to a large transformer dynamics model trained with a causal mask, analogous to that used by large language models. At inference time, Genie 2 is sampled autoregressively, taking individual actions and past latent frames on a frame-by-frame basis. Classifier-free guidance is used to improve action controllability.

Capabilities

Genie 2 produces consistent worlds for up to about one minute (most demonstration examples last 10โ€“20 s) and exhibits a range of emergent properties: identifying the controllable character in a scene, generating counterfactual trajectories from the same starting frame, long-horizon memory (correctly rendering regions briefly out of view), character animation, NPC modelling, physics effects (water, smoke, gravity), point and directional lighting, reflections, bloom and object interactions with proper affordances (opening doors, popping balloons). The model also works with real-world photographs as prompts.

Research applications

Genie 2 is used to generate an unlimited curriculum of novel worlds for training and evaluating embodied agents. In the DeepMind release, the SIMA agent was shown navigating Genie 2-synthesised environments from a single prompt image, controlled via natural-language instructions, with Genie 2 acting as a frame-by-frame simulator that responds to SIMA's actions. The model also enables rapid prototyping of scenes and visual concepts by artists and designers.

Reference and real-time variants

The samples in the announcement post come from an undistilled base model (highest quality). DeepMind also reports a distilled real-time playable version with reduced output quality. The model weights are not publicly released.

Classification
World Model
Family: Genie
Access & deployment
Hosted
Cloud
Weights: Closed
Key parameters
๐Ÿ“ฅ Input: image, structured data
Robotics
Environment modelingSpatial predictionScene understanding

Technical specification

Modalities
โฌ‡ Input
imagestructured_data
โฌ† Output
video

Capabilities and applications

Native model capabilities
Video understanding
The model's ability to analyse and interpret video content โ€” recognising actions, motion, events and relationships between objects over time.
Category: video
Planning
The model's ability to determine a sequence of actions leading to a goal โ€” predicting the consequences of actions and selecting an optimal path in a given environment.
Category: planning
Robotics
Environment modelingSpatial predictionScene understanding

Technical architecture