Spatial intelligence is realized through world models that learn 3D representations from multimodal data - images, video, depth sensors, text descriptions and interaction logs. A typical pipeline combines: (1) 3D perception (NeRF, Gaussian Splatting, depth models) recovering geometry from 2D inputs, (2) a world representation as a latent space or explicit 3D mesh, (3) reasoning and dynamics prediction over that representation using transformers or diffusion models, and (4) action through generated images, 3D scenes or robot policies (Vision-Language-Action). These models are trained on large video and embodied datasets so they capture both appearance and physics of the world.
Classical AI models handle text and 2D images well but struggle with the three-dimensional structure of the world, physics, scene geometry and the consequences of physical action. Spatial intelligence addresses this gap by giving machines 3D representations sufficient for reasoning, planning and acting in space - a prerequisite for general-purpose robotics, immersive environments and generative 3D graphics.
The term 'spatial intelligence' originates in cognitive psychology as one of Gardner's multiple intelligences.
Mildenhall et al. publish NeRF - a breakthrough in neural 3D reconstruction from 2D images. Ben Mildenhall later co-founds World Labs.
Kerbl et al. introduce fast, photorealistic 3D representation that becomes critical for scalable spatial perception.
In April 2024 Fei-Fei Li gives the TED Talk 'With Spatial Intelligence, AI Will Understand the Real World'. In September 2024 she unveils World Labs as a spatial intelligence company, canonizing the term within the AI industry.
DeepMind unveils generative interactive world models as a parallel realisation of the spatial intelligence paradigm.
World Labs ships Marble - a product generating spatially coherent, persistent 3D worlds from a single image, video or text prompt.