ReFL
How it works
ReFL pipeline: (1) A pre-trained reward model (e.g. ImageReward) predicts a scalar score corresponding to human preference for a text-image pair. (2) During diffusion fine-tuning, a denoising step t is sampled from the late range (e.g. last 10 out of N steps). (3) From that step, the final clean image x̂₀ is predicted via a differentiable approximation. (4) The reward model R(prompt, x̂₀) returns a scalar. (5) The gradient ∂R/∂θ is backpropagated through the denoising path to the UNet parameters. (6) Optimization maximizes E[R] with regularization against the original model (KL-like, or leaving early steps unmodified) to prevent „reward hacking".
Problem solved
Classical RLHF for image generators is expensive (PPO requires many samples and has high variance), while Supervised Fine-Tuning on human-selected images is limited by preference dataset size. ReFL solves both by leveraging a differentiable reward model — eliminating policy sampling and allowing the diffusion model to learn directly from the preference signal.
Components
A network (e.g. ImageReward based on CLIP/BLIP) trained on human preference data, returning a scalar R(prompt, image). Must be differentiable with respect to the input image.
Official
Denoising network (typically UNet in Stable Diffusion or DiT in newer models) — the fine-tuning target. Its parameters (or LoRA adapters) are updated.
Approximation of the final clean image from the intermediate noisy state x_t (formula depends on scheduler, e.g. DDIM). Needed so the reward model can evaluate the output.
Official
Component sampling step t from a late range (usually last few of N) — a tradeoff between gradient quality (less noise, better x̂₀) and memory cost of backpropagation.
Official
Implementation
Without regularization, the diffusion model quickly learns to produce artifacts that solely maximize reward, at the cost of realism and diversity.
Backpropagation through multiple denoising steps requires storing all intermediate UNet activations — quickly exceeds VRAM even on A100/H100.
Optimizing for a scalar reward reduces generation diversity to a narrow distribution of images highly rated by the reward model.
Any biases in the human preference data on which the reward model was trained are transferred and amplified in the fine-tuned diffusion model.
Evolution
Xu et al. publish ImageReward and the ReFL algorithm as the first approach using a differentiable reward model to fine-tune diffusion models.
Clark et al. publish DRaFT, extending the ReFL idea by backpropagating gradients through more denoising steps.
Prabhudesai et al. publish AlignProp with additional gradient stabilization techniques for long denoising chains.
Technical details
Hyperparameters (configurable axes)
Range of denoising steps from which the reward gradient computation point is sampled.
Coefficient multiplying the reward model loss combined with regularization (usually pretraining loss).
Method of preventing reward hacking: pre-training loss on early steps, KL to original model, LoRA constraints.
Memory-limited — backpropagation through denoising is memory-hungry.
Execution paradigm
ReFL does not modify the diffusion model's execution paradigm — it remains dense. It only modifies the training phase.
Parallelism
Training is data-parallel but requires memory sufficient for backpropagation through multiple denoising steps — typically limits effective per-device batch size.
Hardware requirements
ReFL requires simultaneous forward and backward through UNet plus reward model — scales best on GPUs with large memory (A100 80GB, H100).
TPUs support diffusion model fine-tuning, but most reference ReFL implementations come from PyTorch/CUDA.