Robots Atlas>ROBOTS ATLAS
Google DeepMind RT-2

Google DeepMind RT-2

2
Vision-language-action (VLA) model by Google DeepMind for robotic control, combining web-scale vision-language pretraining with robotics data, announced July 28, 2023.
🔬 Research🔬 Research onlyRobotics foundation modelMultimodalVision
Parameters
55B / 562B
parameters
Release date
28 July 2023

Overview

Google DeepMind RT-2 (Robotic Transformer 2) is a Vision-Language-Action (VLA) model released in July 2023. It combines image and language understanding with direct robot control. The model learns from both web and robotics data, then transfers that knowledge into generalized robot control instructions. DeepMind describes RT-2 as a model that translates an image and a text command into robot actions. In the paper, actions are represented as text tokens, enabling the model to be trained similarly to large vision-language models. RT-2 is notable for improved generalization to novel objects and commands, as well as basic semantic reasoning during robot control. The evaluation covered approximately 6,000 test trials.

Classification
Robotics foundation modelMultimodalVision
Access & deployment
Weights: Closed
Key parameters
🧩 Parameters: 55B / 562B
📥 Input: image, text
Robotics
Robot controlRobot manipulationDexterous manipulationScene understandingVisual groundingEmbodied task planningObject affordance understandingSpatial reasoning

Technical specification

Parameters
55B / 562B
parameters
License
Proprietary
Hardware requirements
RT-2-PaLI-X-55B: multi-node TPU (Google Cloud TPU v4 Pods), robot control at 1–3 Hz; RT-2-PaLI-X-5B: 5 Hz. Inference for RT-2-PaLI-X-55B requires a cluster of ~8 NVIDIA A100 GPUs or TPU v4 Pods.
Modalities
⬇ Input
imagetext
⬆ Output
robot_actionsmanipulator_controlrobot_commandstext

Capabilities and applications

Native model capabilities
Image understanding
Analysing and interpreting the content of images.
Category: vision
Multimodal understanding
Category: multimodal
Planning
Forming and executing action plans for complex tasks.
Category: planning
Reasoning
The model's ability to reason logically and solve complex problems.
Category: reasoning
Multi-step reasoning
Carrying out multi-step chains of reasoning across long, complex tasks.
Category: reasoning
Robotics
Robot controlRobot manipulationDexterous manipulationScene understandingVisual groundingEmbodied task planningObject affordance understandingSpatial reasoning

Benchmark results

3 benchmarks
Generalization to unseen objects, backgrounds, environments (RT-1 comparison)
Success rate (%) · RT-2-PaLI-X evaluated on tasks involving previously unseen objects, backgrounds, and environments. RT-1 achieved 32% on the same tasks.
62%
📅 28 Jul 2023📄 Brohan et al., arXiv:2307.15818 / Google DeepMind blog (July 2023)
Improvement from 32% (RT-1) to 62% (RT-2-PaLI-X-55B). A total of 6,000 evaluation trials.
Emergent skills evaluation (RT-2-PaLI-X-55B vs RT-1 i VC-1)
Względna poprawa success rate vs RT-1 · Evaluation of emergent capabilities: symbolic reasoning, object and person recognition, semantic reasoning — categories absent from the robotic training data. RT-2-PaLI-X-55B performs ~3× better than RT-1 and VC-1.
~3x
📅 28 Jul 2023📄 robotics-transformer2.github.io / arXiv:2307.15818
Results from the RT-2 research project. This is not an externally standardized benchmark.
Language Table (symulacja)
Success rate (%) · Open-source Language Table benchmark (simulation). Previous SOTA: LAVA 77%, RT-1 74%, BC-Z 72%.
90%
📅 28 Jul 2023📄 Brohan et al., arXiv:2307.15818 / Google DeepMind blog (July 2023)
RT-2 (PaLI-X) achieved 90% in simulation and demonstrated generalization to unseen objects in real-world testing.