Architecture

WBA-FM

2024ActivePublished: 20 June 2026Updated: 20 June 2026Published

Key innovation

A single low-level neural model drives all joints of a humanoid simultaneously — coordinating whole-body motion, locomotion, balance and manipulation — instead of splitting them into separate, hand-tuned controllers.

How it works

1) Build a reference motion dataset — typically human motion capture (AMASS, KIT, in-house MoCap). 2) Retarget human skeletons to the robot's morphology (resolving segment length and joint range differences). 3) Train in a physics simulator (Isaac Gym, MuJoCo) with reinforcement learning, rewarding reference-pose tracking, balance and physical feasibility. 4) Deploy on the real robot — often with an additional lightweight sim-to-real compensation layer. 5) The WBA-FM is driven by a high-level policy that sends it goals (pose, trajectory, direction), not concrete torque values.

Problem solved

Classical humanoid control requires hand-tuned controllers per body part (PD for the arm, MPC for the leg, ZMP for balance). Scaling to new tasks and new morphologies is expensive and whole-body coordination remains fragile. A WBA-FM replaces this stack with a single network that learns coordinated motion from data.

Components

Proprioceptive observation encoderRobot body state representation

Processes joint sensor readings (positions, velocities, torques), IMU and end-effector state into a hidden representation of the robot's state.

Official

Goal encoderHigh-level intent intake

Encodes the goal supplied by the high-level policy: reference pose, end-effector trajectory, walking direction, speed. Typically handles multiple goal representations (keyframes, task parameters).

Official

Motor policyWhole-body motion command generation

Network core — typically a transformer or MLP — emitting coordinated commands for every joint. Output is torques or target positions (depending on the robot's control mode).

Critic (training only)Optimization signal during training

Value head used during reinforcement-learning training. Estimates return from the current state and action, drives policy optimization. Inactive at deployment.

Official

Implementation

Reference implementations

HumanPlus (Humanoid Shadowing Transformer)

Python (PyTorch) · Stanford University (Zipeng Fu et al.)

Official

HumanPlus project page

Stanford University

Official

Implementation pitfalls

Human-to-robot retargetingHigh

Humans and robots have different joint counts, segment lengths and ranges of motion. A naive MoCap projection leads to infeasible poses.

Fix:Inverse kinematics with constraints, per-frame retargeting with feasibility filtering, fine-tuning on robot-specific trajectories.

Sim-to-real gap in controlHigh

Simulation does not accurately reproduce friction, backlash and communication latency — leading to instability on the real robot.

Fix:Domain randomization during training, a lightweight compensation model trained on deployment data, system identification.

Overfitting to a specific robot bodyMedium

A WBA-FM trained on one humanoid (e.g. Unitree H1) transfers poorly to another (e.g. Tesla Optimus). Limits cross-embodiment generalization.

Fix:Add morphology conditioning to the network input (robot parameters) and train on multiple morphologies jointly.

Evolution

Original paper · 2024 · CoRL 2024 · Zipeng Fu

HumanPlus: Humanoid Shadowing and Imitation from Humans

Zipeng Fu, Qingqing Zhao, et al. (Stanford University)

2004

Whole-Body Control (classical)

Oussama Khatib formalizes operational-space whole-body control: hand-tuned controllers driving multiple prioritized objectives (balance + arm motion). The foundation of humanoid robotics for ~20 years.

2024

HumanPlus / Humanoid Shadowing Transformer

Inflection point

Stanford (Fu et al., CoRL 2024) releases HST — a transformer trained with RL in simulation on large MoCap data, controlling a humanoid's whole body in real time. Breakthrough: a neural policy replaces the entire classical WBC stack.

HumanPlus: Humanoid Shadowing and Imitation from Humans (paper)

2024

OmniH2O / ExBody / H2O

A series of works (Stanford, CMU, NVIDIA) extends the paradigm: different whole-body controller variants trained on MoCap, jointly optimizing balance, manipulation and locomotion.

2026

MindOn coins the name Whole-Body Action Foundation Model

MindOn coins the name Whole-Body Action Foundation Model in the Mind-0 architecture for the low-level model trained on tens of thousands of hours of MoCap. Claims sub-3 cm end-effector tracking accuracy and global motion coherence. Serves as a universal execution interface for a heterogeneous fleet.

(concept)