
Vision-language-action (VLA) model by Google DeepMind for robotic control, combining web-scale vision-language pretraining with robotics data, announced July 28, 2023.
🔬 Research🔬 Research onlyRobotics foundation modelMultimodalVision
Parameters
55B / 562B
parameters
Release date
28 July 2023
Overview
Classification
Robotics foundation modelMultimodalVision
Access & deployment
Weights: Closed
Key parameters
🧩 Parameters: 55B / 562B
📥 Input: image, text
Robotics
Robot controlRobot manipulationDexterous manipulationScene understandingVisual groundingEmbodied task planningObject affordance understandingSpatial reasoning
Technical specification
Parameters
55B / 562B
parameters
License
Proprietary
Hardware requirements
RT-2-PaLI-X-55B: multi-node TPU (Google Cloud TPU v4 Pods), robot control at 1–3 Hz; RT-2-PaLI-X-5B: 5 Hz. Inference for RT-2-PaLI-X-55B requires a cluster of ~8 NVIDIA A100 GPUs or TPU v4 Pods.
Modalities
⬇ Input
imagetext
⬆ Output
robot_actionsmanipulator_controlrobot_commandstext
Capabilities and applications
Native model capabilities
Image understanding
Analysing and interpreting the content of images.
Category: vision
Multimodal understanding
Category: multimodal
Planning
Forming and executing action plans for complex tasks.
Category: planning
Reasoning
The model's ability to reason logically and solve complex problems.
Category: reasoning
Multi-step reasoning
Carrying out multi-step chains of reasoning across long, complex tasks.
Category: reasoning
Robotics
Robot controlRobot manipulationDexterous manipulationScene understandingVisual groundingEmbodied task planningObject affordance understandingSpatial reasoning
Benchmark results
3 benchmarks
Generalization to unseen objects, backgrounds, environments (RT-1 comparison)
Success rate (%) · RT-2-PaLI-X evaluated on tasks involving previously unseen objects, backgrounds, and environments. RT-1 achieved 32% on the same tasks.
62%
📅 28 Jul 2023📄 Brohan et al., arXiv:2307.15818 / Google DeepMind blog (July 2023)
Improvement from 32% (RT-1) to 62% (RT-2-PaLI-X-55B). A total of 6,000 evaluation trials.
Emergent skills evaluation (RT-2-PaLI-X-55B vs RT-1 i VC-1)
Względna poprawa success rate vs RT-1 · Evaluation of emergent capabilities: symbolic reasoning, object and person recognition, semantic reasoning — categories absent from the robotic training data. RT-2-PaLI-X-55B performs ~3× better than RT-1 and VC-1.
~3x
📅 28 Jul 2023📄 robotics-transformer2.github.io / arXiv:2307.15818
Results from the RT-2 research project. This is not an externally standardized benchmark.
Language Table (symulacja)
Success rate (%) · Open-source Language Table benchmark (simulation). Previous SOTA: LAVA 77%, RT-1 74%, BC-Z 72%.
90%
📅 28 Jul 2023📄 Brohan et al., arXiv:2307.15818 / Google DeepMind blog (July 2023)
RT-2 (PaLI-X) achieved 90% in simulation and demonstrated generalization to unseen objects in real-world testing.
Sources and related pages
5 sources
PaperRT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic ControlWebRT-2: New model translates vision and language into action – Google DeepMindWebRT-2 Project Website – robotics-transformer2.github.ioWebWhat is RT-2? – Google Blog (July 2023)ReportRT-2 paper PDF – robotics-transformer2.github.io/assets/rt2.pdf
Browse related topics