(1) Training: takes an expert action sequence a_0:T, adds Gaussian noise over K steps (forward diffusion), trains a network epsilon_theta(a_t, t, observation) to predict the added noise. (2) Inference: starts from clean noise a_K ~ N(0,I), iteratively denoises over K steps using a DDPM or DDIM scheduler, conditioned on the current observation o_t. (3) Execution: from the predicted sequence of T_p actions, executes only the first T_a (receding horizon), then repeats inference on a new observation.
Diffusion Policy solves the problem of multimodal expert demonstrations in robotics. Classical behavior cloning with MSE regression averages different correct actions for the same state and produces in-between actions that fail to complete the task. Diffusion Policy models the entire conditional action distribution directly, so a single model learns all correct strategies.
Neural network epsilon_theta(a_t, t, obs) predicting the noise added to the action sequence at step t. In the original paper: 1D U-Net (CNN) with FiLM conditioning, or a Transformer with cross-attention over observations.
Official
Algorithm defining the noise trajectory in forward diffusion (beta schedule: linear, cosine) and the sampling strategy at inference (DDPM iterative, DDIM deterministic and shortened to K=10-20 steps).
Official
Pretrained vision backbone (ResNet-18/50, ViT, CLIP) processing camera images and robot state into a compact conditioning vector for the denoising network.
Official
K denoising steps (typically 10-100) executed sequentially significantly increase prediction time vs a single-layer policy. At high control frequencies (50-100 Hz) this becomes a bottleneck.
Diffusion training assumes inputs with zero mean and unit variance. Raw robot actions (joint positions, velocities) have different scales and distributions โ without normalization the denoising loss becomes unstable and the model does not converge.
A T_p that is too small (e.g., 1 action) loses the stability characteristic of DP and reduces it to a standard policy. A T_p that is too large bloats the output without benefit, and a large T_a (too infrequent replanning) makes the robot ignore new observations.
Diffusion Policy typically requires 50-200 trajectories per task for solid convergence. Inconsistent demonstrations or teleoperation errors are encoded as likely modes of the distribution โ the model learns them faithfully.
The Ho, Jain, Abbeel paper defining the probabilistic diffusion framework for image generation โ the mathematical foundation later used by Diffusion Policy.
Energy-based policy trained contrastively (InfoNCE) โ a direct predecessor of Diffusion Policy in the idea of modeling an implicit conditional density rather than regressing. Diffusion Policy improves stability and quality over IBC.
The paper Diffusion Policy: Visuomotor Policy Learning via Action Diffusion (arXiv 2303.04137) introduces diffusion as a robot policy pattern. The open reference implementation (diffusion-policy.github.io) quickly becomes a standard in the robot learning community.
Octo (Berkeley) and RDT-1B (Tsinghua) scale Diffusion Policy to billion-parameter models pretrained on Open X-Embodiment. OpenVLA combines LLaMA-2 7B with a diffusion-based action head. Diffusion Policy ceases to be a single approach and becomes a standard building block of robotics foundation models.
Physical Intelligence releases ฯ0, in which diffusion is replaced by flow matching (continuous normalizing flows). The claimed advantage is single-step inference instead of iterative sampling. This marks the beginning of a trend away from iterative methods toward continuous ones.
Production robotics foundation models (GO-1 for AgiBot G1/G2 humanoids) use diffusion in the action head as a standard, proven component. The Diffusion Policy architecture is now an embedded standard rather than an experimental approach.
The main inference cost of Diffusion Policy is K forward passes through the denoising network (typically K=10 with DDIM, K=100 with DDPM). For a manipulator running at 10 Hz, K=10 with a simple 1D U-Net on an RTX 4090 is realistic; K=100 requires DDIM or distillation. Training is dominated by standard backprop costs through U-Net/Transformer.
Training is massively parallel on GPU. K-step denoising inference also maps well to GPU via batching multiple agents or parallel predictions.
A lightweight DP (1D U-Net, DDIM K=10) runs on Jetson AGX Orin / Thor with <30 ms latency for a 7-DoF manipulator. INT8 quantization is required for heavier Transformer models.
Iterative denoising on CPU is too slow for real-time robot control (>200 ms latency per inference).