Vision-Language-Action (VLA) model by Google DeepMind that converts visual inputs and language instructions into motor commands for robots.
โณ Previewโณ Limited accessMultimodalRobotics foundation modelVision-Language-Action model๐ Gemini
Context window
32K
tokens
Release date
14 April 2026
Access:HostedDeployment:โ Cloud
Overview
Applications
Access & deployment
Hosted
Cloud
Weights: Closed
Key parameters
๐ Context: 32K
๐ฅ Input: text, image
Robotics
Dexterous manipulationRobot manipulationRobot controlEmbodied task planningVisual groundingBimanual manipulationMotion planning
Technical specification
Context window
32K
tokens
Modalities
โฌ Input
textimage
โฌ Output
textaction
Capabilities and applications
Native model capabilities
Reasoning
Category: reasoning
Multi-step reasoning
Category: reasoning
Planning
Category: planning
Image understanding
Category: vision
Multimodal understanding
Category: multimodal
Multilingual
Category: language
Robotics
Dexterous manipulationRobot manipulationRobot controlEmbodied task planningVisual groundingBimanual manipulationMotion planning
Application domains
Benchmark results
5 benchmarks
Generalization: In-Distribution (internal)
progress score ยท progress score, robotic manipulation tasks
0.830-1
๐ https://deepmind.google/models/gemini-robotics/gemini-robotics/
Gemini Robotics 1.5 vs. prior versions. Score 0.83 outperforms Gemini Robotics and On-Device.
Generalization: Instruction Generalization (internal)
progress score
0.760-1
๐ https://deepmind.google/models/gemini-robotics/gemini-robotics/
Generalization: Action Generalization (internal)
progress score
0.540-1
๐ https://deepmind.google/models/gemini-robotics/gemini-robotics/
Generalization: Visual Generalization (internal)
progress score
0.810-1
๐ https://deepmind.google/models/gemini-robotics/gemini-robotics/
Generalization: Task Generalization (internal)
progress score
0.700-1
๐ https://deepmind.google/models/gemini-robotics/gemini-robotics/
