Google DeepMind RT-2

Vision-language-action (VLA) model by Google DeepMind for robotic control, combining web-scale vision-language pretraining with robotics data, announced July 28, 2023.

🔬 Research🔬 Research onlyRobotics foundation modelMultimodalVision

Parameters

55B / 562B

parameters

Release date

28 July 2023

🔬Google DeepMindResearch lab 🏢GoogleOwner

Overview

Google DeepMind RT-2 (Robotic Transformer 2) is a Vision-Language-Action (VLA) model released in July 2023. It combines image and language understanding with direct robot control. The model learns from both web and robotics data, then transfers that knowledge into generalized robot control instructions. DeepMind describes RT-2 as a model that translates an image and a text command into robot actions. In the paper, actions are represented as text tokens, enabling the model to be trained similarly to large vision-language models. RT-2 is notable for improved generalization to novel objects and commands, as well as basic semantic reasoning during robot control. The evaluation covered approximately 6,000 test trials.

Classification

Robotics foundation modelMultimodalVision

Access & deployment

Weights: Closed

Key parameters

🧩 Parameters: 55B / 562B

📥 Input: image, text

Robotics

Robot controlRobot manipulationDexterous manipulationScene understandingVisual groundingEmbodied task planningObject affordance understandingSpatial reasoning

Technical specification

Parameters

55B / 562B

parameters

License

Proprietary

Hardware requirements

RT-2-PaLI-X-55B: multi-node TPU (Google Cloud TPU v4 Pods), robot control at 1–3 Hz; RT-2-PaLI-X-5B: 5 Hz. Inference for RT-2-PaLI-X-55B requires a cluster of ~8 NVIDIA A100 GPUs or TPU v4 Pods.

Modalities

⬇ Input

imagetext

⬆ Output

robot_actionsmanipulator_controlrobot_commandstext

Capabilities and applications

Native model capabilities

Image understanding

Analysing and interpreting the content of images.

Category: vision

Multimodal understanding

Category: multimodal

Planning

Forming and executing action plans for complex tasks.

Category: planning

Reasoning

The model's ability to reason logically and solve complex problems.

Category: reasoning

Multi-step reasoning

Carrying out multi-step chains of reasoning across long, complex tasks.

Category: reasoning

Robotics

Robot controlRobot manipulationDexterous manipulationScene understandingVisual groundingEmbodied task planningObject affordance understandingSpatial reasoning

Benchmark results

3 benchmarks

Generalization to unseen objects, backgrounds, environments (RT-1 comparison)

Success rate (%) · RT-2-PaLI-X evaluated on tasks involving previously unseen objects, backgrounds, and environments. RT-1 achieved 32% on the same tasks.

62%

📅 28 Jul 2023📄 Brohan et al., arXiv:2307.15818 / Google DeepMind blog (July 2023)

Improvement from 32% (RT-1) to 62% (RT-2-PaLI-X-55B). A total of 6,000 evaluation trials.

Emergent skills evaluation (RT-2-PaLI-X-55B vs RT-1 i VC-1)

Względna poprawa success rate vs RT-1 · Evaluation of emergent capabilities: symbolic reasoning, object and person recognition, semantic reasoning — categories absent from the robotic training data. RT-2-PaLI-X-55B performs ~3× better than RT-1 and VC-1.

~3x

📅 28 Jul 2023📄 robotics-transformer2.github.io / arXiv:2307.15818

Results from the RT-2 research project. This is not an externally standardized benchmark.

Language Table (symulacja)

Success rate (%) · Open-source Language Table benchmark (simulation). Previous SOTA: LAVA 77%, RT-1 74%, BC-Z 72%.

90%

📅 28 Jul 2023📄 Brohan et al., arXiv:2307.15818 / Google DeepMind blog (July 2023)

RT-2 (PaLI-X) achieved 90% in simulation and demonstrated generalization to unseen objects in real-world testing.

Sources and related pages

5 sources

PaperRT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Controlarxiv.org WebRT-2: New model translates vision and language into action – Google DeepMinddeepmind.google WebRT-2 Project Website – robotics-transformer2.github.iorobotics-transformer2.github.io WebWhat is RT-2? – Google Blog (July 2023)blog.google ReportRT-2 paper PDF – robotics-transformer2.github.io/assets/rt2.pdfrobotics-transformer2.github.io

Browse related topics

All robotics foundation model models All multimodal model models