Robots Atlas>ROBOTS ATLAS
Training

ReFL

2023ActiveUpdated: 12 May 2026Published
Key innovation
Directly backpropagates gradients from a differentiable reward model (such as ImageReward) through selected denoising steps of a diffusion model, fine-tuning it to human preferences without the cost of full RL.
Category
Training
Abstraction level
Pattern
Use cases
Text-to-image fine-tuningAesthetic preference alignment for diffusion modelsAnatomical artifact correctionImproving prompt-image alignmentStyle personalization

How it works

ReFL pipeline: (1) A pre-trained reward model (e.g. ImageReward) predicts a scalar score corresponding to human preference for a text-image pair. (2) During diffusion fine-tuning, a denoising step t is sampled from the late range (e.g. last 10 out of N steps). (3) From that step, the final clean image x̂₀ is predicted via a differentiable approximation. (4) The reward model R(prompt, x̂₀) returns a scalar. (5) The gradient ∂R/∂θ is backpropagated through the denoising path to the UNet parameters. (6) Optimization maximizes E[R] with regularization against the original model (KL-like, or leaving early steps unmodified) to prevent „reward hacking".

Problem solved

Classical RLHF for image generators is expensive (PPO requires many samples and has high variance), while Supervised Fine-Tuning on human-selected images is limited by preference dataset size. ReFL solves both by leveraging a differentiable reward model — eliminating policy sampling and allowing the diffusion model to learn directly from the preference signal.

Components

Differentiable reward modelSource of training signal

A network (e.g. ImageReward based on CLIP/BLIP) trained on human preference data, returning a scalar R(prompt, image). Must be differentiable with respect to the input image.

Official

Diffusion model (UNet)Fine-tuned generator

Denoising network (typically UNet in Stable Diffusion or DiT in newer models) — the fine-tuning target. Its parameters (or LoRA adapters) are updated.

x₀ predictionBridge between denoising and reward model

Approximation of the final clean image from the intermediate noisy state x_t (formula depends on scheduler, e.g. DDIM). Needed so the reward model can evaluate the output.

Official

Late denoising step samplerSelects gradient computation point

Component sampling step t from a late range (usually last few of N) — a tradeoff between gradient quality (less noise, better x̂₀) and memory cost of backpropagation.

Official

Implementation

Implementation pitfalls
Reward hackingCritical

Without regularization, the diffusion model quickly learns to produce artifacts that solely maximize reward, at the cost of realism and diversity.

Fix:Apply pretraining loss on early denoising steps, KL regularization against the base model, limit number of training steps.
High memory consumptionHigh

Backpropagation through multiple denoising steps requires storing all intermediate UNet activations — quickly exceeds VRAM even on A100/H100.

Fix:Gradient checkpointing, limit late_step_range, fine-tune via LoRA instead of full weights.
Diversity collapse (mode collapse)High

Optimizing for a scalar reward reduces generation diversity to a narrow distribution of images highly rated by the reward model.

Fix:Mixed batches with pretraining loss, use multiple reward models, early stopping.
Reward model biasMedium

Any biases in the human preference data on which the reward model was trained are transferred and amplified in the fine-tuned diffusion model.

Fix:Audit preference dataset, ensemble multiple reward models trained on different data.

Evolution

Original paper · 2023 · NeurIPS 2023 · Jiazheng Xu
ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation
Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Xu, Weiyun Zhang, Jie Tang, Yuxiao Dong
2023
Introduction of ReFL in the ImageReward paper
Inflection point

Xu et al. publish ImageReward and the ReFL algorithm as the first approach using a differentiable reward model to fine-tune diffusion models.

2023
DRaFT — gradient backpropagation through full trajectory

Clark et al. publish DRaFT, extending the ReFL idea by backpropagating gradients through more denoising steps.

2023
AlignProp — stable reward gradient backpropagation through denoising

Prabhudesai et al. publish AlignProp with additional gradient stabilization techniques for long denoising chains.

Technical details

Hyperparameters (configurable axes)

Late step rangeHigh

Range of denoising steps from which the reward gradient computation point is sampled.

1–10 ostatnich z 40
1–5 ostatnich
Reward loss weightHigh

Coefficient multiplying the reward model loss combined with regularization (usually pretraining loss).

Regularization strategyCritical

Method of preventing reward hacking: pre-training loss on early steps, KL to original model, LoRA constraints.

Batch sizeMedium

Memory-limited — backpropagation through denoising is memory-hungry.

Execution paradigm

Primary mode
dense

ReFL does not modify the diffusion model's execution paradigm — it remains dense. It only modifies the training phase.

Activation pattern
all_paths_active

Parallelism

Parallelism level
partially_parallel

Training is data-parallel but requires memory sufficient for backpropagation through multiple denoising steps — typically limits effective per-device batch size.

Scope
trainingacross_devices

Hardware requirements

Primary

ReFL requires simultaneous forward and backward through UNet plus reward model — scales best on GPUs with large memory (A100 80GB, H100).

Good fit

TPUs support diffusion model fine-tuning, but most reference ReFL implementations come from PyTorch/CUDA.