**RT-2 (Robotics Transformer 2)** is a breakthrough Vision-Language-Action model announced by Google DeepMind in July 2023 (paper 'RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control', Brohan et al., arXiv:2307.15818). The successor to RT-1, but unlike RT-1 (which was a specialized 35M-parameter transformer trained exclusively on robotic data), RT-2 **co-trains** a massive pretrained Vision-Language Model (PaLI-X 55B or PaLM-E 12B) on a **mixed dataset** of web data (~10B image-text pairs) + robotic data from RT-1 (~130k episodes).
The key innovation: **actions as tokens**. RT-2 discretizes the robot's action space (6-DOF end-effector translation/rotation + gripper) into 256 bins per dimension and treats them as **additional tokens in the VLM's vocabulary**. This lets the model uniformly generate the 'answer' as a sequence of tokens — whether textual (VQA) or a robot action. This approach enables **emergent capabilities**: RT-2 can perform semantic tasks requiring chain-of-thought reasoning ('Move banana to the sum of two and one' → count the items, find '3' as the answer, approach the third object) that are absent from the training data.
Results: RT-2 achieves **62% success in generalization scenarios** (novel objects, novel instructions, novel backgrounds) vs. 32% for RT-1. Experiments were run on Everyday Robots mobile manipulators (a Google internal project — discontinued since 2023) and Franka. RT-2 **is not open source** — Google DeepMind only released a checkpoint of a small replication version, with no full PaLI-X/PaLM-E weights. Successors: **RT-X / Open X-Embodiment** (October 2023, cross-embodiment generalization), **Gemini Robotics** (March 2025, Apptronik Apollo integration).
RT-2 launched the era of 'foundation models for robotics' and influenced an entire generation of subsequent models: OpenVLA (Stanford/Berkeley, open-source replication), π0 (Physical Intelligence), Octo (Berkeley), CogACT. Most VLAs from 2024-2026 inherit the 'tokens as actions' architecture from RT-2.