Genie 2

2 · Family: Genie

Foundation world model from Google DeepMind that generates action-controllable 3D environments from a single prompt image. Consistent worlds for up to about a minute, controlled with keyboard and mouse.

🔬 Research🔬 Research onlyWorld Model📁 Genie

Release date

4 December 2024

🏢Google DeepMindProducer

Access:HostedDeployment:☁ Cloud

Overview

Genie 2 is a foundation world model developed by Google DeepMind, introduced on 4 December 2024. The model generates action-controllable, playable 3D environments based on a single prompt image (for example generated by Imagen 3) — and can be played by a human or an AI agent using keyboard and mouse inputs.

Architecture

Genie 2 is an autoregressive latent-diffusion model trained on a large video dataset. After passing through an autoencoder, latent video frames are fed to a large transformer dynamics model trained with a causal mask, analogous to that used by large language models. At inference time, Genie 2 is sampled autoregressively, taking individual actions and past latent frames on a frame-by-frame basis. Classifier-free guidance is used to improve action controllability.

Capabilities

Genie 2 produces consistent worlds for up to about one minute (most demonstration examples last 10–20 s) and exhibits a range of emergent properties: identifying the controllable character in a scene, generating counterfactual trajectories from the same starting frame, long-horizon memory (correctly rendering regions briefly out of view), character animation, NPC modelling, physics effects (water, smoke, gravity), point and directional lighting, reflections, bloom and object interactions with proper affordances (opening doors, popping balloons). The model also works with real-world photographs as prompts.

Research applications

Genie 2 is used to generate an unlimited curriculum of novel worlds for training and evaluating embodied agents. In the DeepMind release, the SIMA agent was shown navigating Genie 2-synthesised environments from a single prompt image, controlled via natural-language instructions, with Genie 2 acting as a frame-by-frame simulator that responds to SIMA's actions. The model also enables rapid prototyping of scenes and visual concepts by artists and designers.

Reference and real-time variants

The samples in the announcement post come from an undistilled base model (highest quality). DeepMind also reports a distilled real-time playable version with reduced output quality. The model weights are not publicly released.

Classification

World Model

Family: Genie

Access & deployment

Hosted

Cloud

Weights: Closed

Key parameters

📥 Input: image, structured data

Robotics

Environment modelingSpatial predictionScene understanding

Technical specification

Modalities

⬇ Input

imagestructured_data

⬆ Output

video

Capabilities and applications

Native model capabilities

Video understanding

The model's ability to analyse and interpret video content — recognising actions, motion, events and relationships between objects over time.

Category: video

Planning

Forming and executing action plans for complex tasks.

Category: planning

Robotics

Environment modelingSpatial predictionScene understanding

Technical architecture

Core Architecture

DMDiffusion Model TRTransformer

Model Form

WMWorld Models WAWAM

Sources and related pages

3 sources

BlogGenie 2: A large-scale foundation world model (Google DeepMind)deepmind.google WebGenie — Google DeepMind models pagedeepmind.google WebProject Genie — Google Labslabs.google

Browse related topics

📁 Genie 🧠 Diffusion Model 🧠 Transformer 🧠 World Models All world model models