1) Collect data from multiple sources: teleoperation across different robots or human demonstrations (motion capture, egocentric cameras). 2) Convert to a shared action representation via a cross-embodiment pipeline (e.g. kinematics-independent action tokenization). 3) Train a high-level policy on the unified dataset (task planning, scene understanding). 4) A low-level controller translates intent into physical motion respecting each robot's dynamics. 5) Optionally: a lightweight sim-to-real compensation model corrects hardware-specific errors.
Classical robot policy learning required collecting a separate dataset and training a separate model for every robot body. This made robotics unable to scale in the way LLMs do. Cross-Embodiment Learning addresses this by decoupling intelligence from embodiment and letting a single model drive many platforms.
A layer that converts observations and actions from different robots (or human demonstrations) into a shared representation. May be proprioceptive normalization, canonical state representation, or action tokenization.
Official
An AI model (typically VLA or a robotics foundation model) producing task-level behavior — what to do, in what order, where to direct attention. Operates on embodiment-agnostic actions.
Embodiment-specific component — translates abstract intent into concrete motor commands, torques, trajectories and control signals that respect the specific robot's dynamics and constraints.
Official
An optional, lightweight layer that corrects tracking errors and dynamics mismatch between simulation and real hardware. Trained on a small dataset from real deployments.
Official
One robot's action space may be unreachable for another (reach, degrees of freedom). Direct imitation leads to execution errors.
Human data is delay-free, robot data has real latency. Direct imitation leads to desynchronization.
A policy trained in simulation often fails on the real robot due to dynamics mismatch, friction and latency.
Google Robotics releases RT-1 — the first large transformer trained on data from 13 robots. Shows that robot policies can be scaled like LLMs.
A consortium of 34 institutions publishes 1M+ trajectories from 22 robot types. RT-X demonstrates positive cross-embodiment skill transfer.
Physical Intelligence (PI) releases pi-0 — a generalist VLA trained cross-embodiment on 8 platforms.
MindOn shows that a cross-embodiment policy can be trained purely from human-centric data (whole-body motion capture, egocentric cameras), without robot teleoperation. Demo: one model simultaneously driving a Unitree G1 humanoid and a stationary dual-arm rig.
The high-level policy is dense, but the choice of low-level controller depends on embodiment (conditional). The full architecture scales like a mixture in the sense of being spread across different bodies.
Training the high-level policy on large-scale motion-capture and video datasets requires data-center class GPUs.
The paradigm itself is agnostic to robot hardware — it runs on humanoids, dual-arm rigs and mobile manipulators.