1) Build a reference motion dataset — typically human motion capture (AMASS, KIT, in-house MoCap). 2) Retarget human skeletons to the robot's morphology (resolving segment length and joint range differences). 3) Train in a physics simulator (Isaac Gym, MuJoCo) with reinforcement learning, rewarding reference-pose tracking, balance and physical feasibility. 4) Deploy on the real robot — often with an additional lightweight sim-to-real compensation layer. 5) The WBA-FM is driven by a high-level policy that sends it goals (pose, trajectory, direction), not concrete torque values.
Classical humanoid control requires hand-tuned controllers per body part (PD for the arm, MPC for the leg, ZMP for balance). Scaling to new tasks and new morphologies is expensive and whole-body coordination remains fragile. A WBA-FM replaces this stack with a single network that learns coordinated motion from data.
Processes joint sensor readings (positions, velocities, torques), IMU and end-effector state into a hidden representation of the robot's state.
Official
Encodes the goal supplied by the high-level policy: reference pose, end-effector trajectory, walking direction, speed. Typically handles multiple goal representations (keyframes, task parameters).
Official
Network core — typically a transformer or MLP — emitting coordinated commands for every joint. Output is torques or target positions (depending on the robot's control mode).
Value head used during reinforcement-learning training. Estimates return from the current state and action, drives policy optimization. Inactive at deployment.
Official
Humans and robots have different joint counts, segment lengths and ranges of motion. A naive MoCap projection leads to infeasible poses.
Simulation does not accurately reproduce friction, backlash and communication latency — leading to instability on the real robot.
A WBA-FM trained on one humanoid (e.g. Unitree H1) transfers poorly to another (e.g. Tesla Optimus). Limits cross-embodiment generalization.
Oussama Khatib formalizes operational-space whole-body control: hand-tuned controllers driving multiple prioritized objectives (balance + arm motion). The foundation of humanoid robotics for ~20 years.
Stanford (Fu et al., CoRL 2024) releases HST — a transformer trained with RL in simulation on large MoCap data, controlling a humanoid's whole body in real time. Breakthrough: a neural policy replaces the entire classical WBC stack.
A series of works (Stanford, CMU, NVIDIA) extends the paradigm: different whole-body controller variants trained on MoCap, jointly optimizing balance, manipulation and locomotion.
MindOn coins the name Whole-Body Action Foundation Model in the Mind-0 architecture for the low-level model trained on tens of thousands of hours of MoCap. Claims sub-3 cm end-effector tracking accuracy and global motion coherence. Serves as a universal execution interface for a heterogeneous fleet.
Time complexity: O(N · L · d²) na krok sterowania. Space complexity: O(N · d² + N · L · d) ≈ O(N · d²).
Because the context window is small and fixed (e.g. 8 steps), attention is NOT the bottleneck — unlike in LLMs. Dense matmuls in the projection and FFN layers (O(d²)) dominate. At deployment the critical factor is the latency of a single forward pass in a 50-200 Hz loop, not throughput: each step must fit within one control-cycle time budget.
Every forward pass activates the whole network and emits commands for all joints simultaneously. Control typically runs in a 50-200 Hz closed loop.
RL training (PPO) parallelizes massively across tens of thousands of simulated environments at once (Isaac Gym), hence across_devices. A single forward pass is fully parallel inside the network (across_tokens over the context window). Only the real-time control loop remains sequential.
RL training in simulation requires tens of thousands of parallel environments (Isaac Gym) — data-center class GPUs.
The deployed model is small (single MLP/Transformer) — runs on the robot's embedded compute (Jetson Orin, x86 mini-PC).