1) 21 partner institutions + Google DeepMind contribute 60 existing datasets from 22 robot platforms (UR5, FR3, xArm, Sawyer, Aloha arms, Google Robot, Stretch and others). 2) All trajectories are converted to a common RLDS (Reinforcement Learning Datasets) format with a unified action representation (a 7-dimensional vector: x, y, z, roll, pitch, yaw + gripper). 3) Dimensions not exercised by a given robot are set to zero during training — this allows mixing data from 4-DoF and 7-DoF platforms. 4) RT-1-X (a Transformer for control) and RT-2-X (a 55B VLM co-fine-tuned to emit actions as text tokens) are trained on the full mixture. 5) In-distribution skill evaluation across 6 academic labs shows RT-1-X outperforms original methods by ~50% in the small-data regime, and RT-2-X displays emergent skills.
Every robotics lab collected its own dataset for its own robot, task and environment. The result: fragmentation, no transfer, no consolidation like in NLP/CV. OXE pools these datasets into a single corpus with a unified format and shows that one model trained on this mixture outperforms specialist models trained only on data from a single robot.
A consolidated corpus of 1M+ real-robot trajectories from 22 platforms, 527 skills, 160,266 tasks. Built by merging 60 datasets from 34 labs. Hosted as RLDS on Google Cloud and HuggingFace.
Action representation as a 7-dimensional vector (x, y, z, roll, pitch, yaw + gripper) expressed in the robot's gripper frame. Unsupported dimensions for a given platform are zero-padded during training, enabling mixing of data across morphologies.
Official
Reference models trained on the full OXE mixture: RT-1-X (an efficient Transformer for robot control) and RT-2-X (a 55B VLM co-fine-tuned to emit actions as text tokens). They exhibit positive transfer and emergent skills.
Official
A consortium of 21 partner institutions + Google DeepMind, coordinating dataset contributions, format consistency, evaluation, and repository maintenance. An Open Dataset Enrollment Form lets new datasets be added.
The 60 original datasets have different action representations, camera setups, calibrations and directory structures. Consolidation into RLDS required months of work across the consortium.
Some platforms (e.g. Google Robot, Bridge) dominate trajectory counts. Naive mixing leads to overfitting to the most common embodiments.
Zero-padding of unsupported dimensions (e.g. no pitch/yaw on certain robots) introduces systematic bias when the model learns to treat 'zero' as a genuine action value.
Google Robotics releases RT-1 — the first large Transformer for robot control trained on data from 13 Google robots. Shows that LLM-style scaling laws can apply to robotics.
A 34-lab consortium publishes arXiv 2310.08864 and the repository github.com/google-deepmind/open_x_embodiment. 60 prior datasets consolidated into the unified RLDS format. Dataset and code released under an open license.
The Open X-Embodiment paper receives the Best Paper Award at ICRA 2024 — one of the highest honors in robotics. Cements OXE as an industry-wide reference point.
Subsequent open VLA models — Octo (Berkeley/Stanford/CMU), OpenVLA (Stanford), pi-0 (Physical Intelligence) — use OXE as their primary or supplementary training corpus. OXE effectively becomes the ImageNet of robotics.
Training RT-X-class models on 1M+ trajectories requires data-center GPU clusters (TPU v4 or H100). For the 55B-parameter RT-2-X this means hundreds of accelerators.
Google DeepMind trained RT-X on TPUs. The OXE codebase and workflow are tuned for TPU/JAX.