Vision-Language Model by Google DeepMind with advanced spatial and embodied reasoning, designed for robotics applications.
Context window
128K
tokens
Max output
64,000
tokens
Release date
14 April 2026
Access:APIHostedDeployment:โ Cloud
Overview
Applications
Access & deployment
APIHosted
Cloud
Weights: Closed
Key parameters
๐ Context: 128K
โ Tools
๐ฅ Input: text, image, audio, video
Robotics
Spatial reasoningScene understandingEmbodied task planningVisual groundingObject affordance understandingSpatial prediction
Technical specification
Context window
128K
tokens
Max output tokens
64,000
tokens per response
Features:โ Tool use
Modalities
โฌ Input
textimageaudiovideo
โฌ Output
text
Capabilities and applications
Native model capabilities
Reasoning
Category: reasoning
Multi-step reasoning
Category: reasoning
Planning
Category: planning
Image understanding
Category: vision
Multimodal understanding
Category: multimodal
Function Calling
Category: planning
Structured output
Category: structured_generation
Video Understanding
Category: video
Audio understanding
Category: audio
Robotics
Spatial reasoningScene understandingEmbodied task planningVisual groundingObject affordance understandingSpatial prediction
Application domains
Benchmark results
2 benchmarks
Instrument Reading (internal, agentic vision disabled)
success rate ยท agentic vision disabled
86%
๐ https://deepmind.google/blog/gemini-robotics-er-1-6/
Score for Gemini Robotics-ER 1.6 without agentic vision. For comparison: ER 1.5 = 23%, Gemini 3.0 Flash = 67%.
Instrument Reading (internal, agentic vision enabled)
success rate ยท agentic vision enabled (zoom + code execution)
93%
๐ https://deepmind.google/blog/gemini-robotics-er-1-6/
Score with agentic vision mode combining visual reasoning with code execution.
Technical architecture
Core Architecture
Model Form
Training Techniques
