Architecture

Diffusion Policy

2023ActiveUpdated: 23 June 2026Published

Key innovation

Representing a robot's visuomotor policy as a conditional denoising (diffusion) process instead of direct action regression. Enables modeling multimodal expert action distributions and stably learns long action sequences from demonstrations.

How it works

(1) Training: takes an expert action sequence a_0:T, adds Gaussian noise over K steps (forward diffusion), trains a network epsilon_theta(a_t, t, observation) to predict the added noise. (2) Inference: starts from clean noise a_K ~ N(0,I), iteratively denoises over K steps using a DDPM or DDIM scheduler, conditioned on the current observation o_t. (3) Execution: from the predicted sequence of T_p actions, executes only the first T_a (receding horizon), then repeats inference on a new observation.

Problem solved

Diffusion Policy solves the problem of multimodal expert demonstrations in robotics. Classical behavior cloning with MSE regression averages different correct actions for the same state and produces in-between actions that fail to complete the task. Diffusion Policy models the entire conditional action distribution directly, so a single model learns all correct strategies.

Key mechanisms

Forward diffusion: gradual addition of Gaussian noise to expert action sequences over K steps

Conditional denoising network: epsilon_theta(a_t, t, observation) — 1D U-Net or Transformer

Conditioning on observations: FiLM (CNN) or cross-attention (Transformer)

Iterative sampling at inference: DDPM (K=100) or DDIM (K=10-20) for lower latency

Action chunking + receding horizon: predict T_p actions, execute T_a, re-plan

Sinusoidal diffusion-step position embedding for richer conditioning

Strengths & limitations

Strengths

✓Native modeling of multimodal actions without assumptions on the number of modes

✓Training stability (the denoising loss is simple and well-conditioned)

✓Excellent results on manipulation benchmarks (Push-T, Robomimic, RoboTwin)

✓Scales well to long horizons via action chunking + receding horizon

✓Plug-and-play architecture — works with both CNN and Transformer backbones

✓Open reference implementation from Cheng Chi (Columbia) with models and datasets

✓Became a standard baseline for newer methods like Octo, π0, RDT-1B

Limitations

✗High inference latency — K denoising steps (10-100) increase prediction time vs a single-pass MLP

✗Requires a reasonable number of demonstrations (typically 50-200 trajectories per task) — performs worse in the few-shot regime

✗Does not natively model language — requires a separate text encoder (e.g., CLIP) or hybridization with a VLM

✗Hyperparameter choice (T_p, T_a, K, scheduler) requires empirical tuning per task

✗No native support for online fine-tuning (RL fine-tuning remains an active research area)

✗Larger GPU memory footprint vs a single-layer MLP policy — especially for the Transformer variant

Components

Denoising NetworkThe most important component — learns to reverse the diffusion process in action space conditioned on observations.

Neural network epsilon_theta(a_t, t, obs) predicting the noise added to the action sequence at step t. In the original paper: 1D U-Net (CNN) with FiLM conditioning, or a Transformer with cross-attention over observations.

1D U-Net with FiLM1D convolutions along the action time axis; FiLM conditioning on the observation embedding.

Transformer encoder-decoderCross-attention between observation embedding and the denoised action sequence. Scales better to long horizons.

Official

Noise SchedulerControls the trade-off between prediction quality and inference latency.

Algorithm defining the noise trajectory in forward diffusion (beta schedule: linear, cosine) and the sampling strategy at inference (DDPM iterative, DDIM deterministic and shortened to K=10-20 steps).

DDPMIterative stochastic sampling with K=100 steps — high quality, high latency.

DDIMDeterministic sampling accelerated to K=10-20 steps without quality loss on many tasks.

Official

Observation EncoderProvides the visual context on which action generation is conditioned.

Pretrained vision backbone (ResNet-18/50, ViT, CLIP) processing camera images and robot state into a compact conditioning vector for the denoising network.

Official

Implementation

Implementation pitfalls

High inference latencyHigh

K denoising steps (typically 10-100) executed sequentially significantly increase prediction time vs a single-layer policy. At high control frequencies (50-100 Hz) this becomes a bottleneck.

Fix:Use DDIM instead of DDPM (K drops from 100 to 10-20), consistency models, or distillation into one-step approximations. Reducing replanning frequency (larger T_a in receding horizon) also helps.

Improper action normalizationHigh

Diffusion training assumes inputs with zero mean and unit variance. Raw robot actions (joint positions, velocities) have different scales and distributions — without normalization the denoising loss becomes unstable and the model does not converge.

Fix:Compute per-dimension action statistics on the training set, normalize to [-1, 1] or N(0, 1), store the statistics alongside the model and denormalize at inference. Standard preprocessing in the reference implementation.

Wrong choice of prediction horizon T_p and execution horizon T_aMedium

A T_p that is too small (e.g., 1 action) loses the stability characteristic of DP and reduces it to a standard policy. A T_p that is too large bloats the output without benefit, and a large T_a (too infrequent replanning) makes the robot ignore new observations.

Fix:Standard starting point from Chi et al.: T_o=2 (observations), T_p=16 (prediction), T_a=8 (execution). For long-horizon tasks increase T_p; for high-precision tasks decrease T_a.

Insufficient demonstration quantity or qualityMedium

Diffusion Policy typically requires 50-200 trajectories per task for solid convergence. Inconsistent demonstrations or teleoperation errors are encoded as likely modes of the distribution — the model learns them faithfully.

Fix:Adversarial Data Collection (ADC, AgiBot 2025) or data filtering. Test collection-execution consistency: replay collected trajectories on the robot and reject those with deviation > epsilon.

Evolution

Original paper · 2023 · Robotics: Science and Systems (RSS) 2023 · Cheng Chi

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, Shuran Song

2020

DDPM — Denoising Diffusion Probabilistic Models (Ho et al.)

The Ho, Jain, Abbeel paper defining the probabilistic diffusion framework for image generation — the mathematical foundation later used by Diffusion Policy.

2022

IBC (Implicit Behavior Cloning, Florence et al.)

Energy-based policy trained contrastively (InfoNCE) — a direct predecessor of Diffusion Policy in the idea of modeling an implicit conditional density rather than regressing. Diffusion Policy improves stability and quality over IBC.

2023

Chi et al. publish Diffusion Policy at RSS 2023

Inflection point

The paper Diffusion Policy: Visuomotor Policy Learning via Action Diffusion (arXiv 2303.04137) introduces diffusion as a robot policy pattern. The open reference implementation (diffusion-policy.github.io) quickly becomes a standard in the robot learning community.

2024

Octo, RDT-1B, OpenVLA — Diffusion Policy at foundation-model scale

Inflection point

Octo (Berkeley) and RDT-1B (Tsinghua) scale Diffusion Policy to billion-parameter models pretrained on Open X-Embodiment. OpenVLA combines LLaMA-2 7B with a diffusion-based action head. Diffusion Policy ceases to be a single approach and becomes a standard building block of robotics foundation models.

2024

Physical Intelligence π0 — flow matching as successor

Physical Intelligence releases π0, in which diffusion is replaced by flow matching (continuous normalizing flows). The claimed advantage is single-step inference instead of iterative sampling. This marks the beginning of a trend away from iterative methods toward continuous ones.

2025

AGIBOT GO-1, GO-1 Air — hybrids with Latent Planner + Action Diffusion

Production robotics foundation models (GO-1 for AgiBot G1/G2 humanoids) use diffusion in the action head as a standard, proven component. The Diffusion Policy architecture is now an embedded standard rather than an experimental approach.