Training

ReFL

2023ActiveUpdated: 12 May 2026Published

Key innovation

Directly backpropagates gradients from a differentiable reward model (such as ImageReward) through selected denoising steps of a diffusion model, fine-tuning it to human preferences without the cost of full RL.

How it works

ReFL pipeline: (1) A pre-trained reward model (e.g. ImageReward) predicts a scalar score corresponding to human preference for a text-image pair. (2) During diffusion fine-tuning, a denoising step t is sampled from the late range (e.g. last 10 out of N steps). (3) From that step, the final clean image x̂₀ is predicted via a differentiable approximation. (4) The reward model R(prompt, x̂₀) returns a scalar. (5) The gradient ∂R/∂θ is backpropagated through the denoising path to the UNet parameters. (6) Optimization maximizes E[R] with regularization against the original model (KL-like, or leaving early steps unmodified) to prevent „reward hacking".

Problem solved

Classical RLHF for image generators is expensive (PPO requires many samples and has high variance), while Supervised Fine-Tuning on human-selected images is limited by preference dataset size. ReFL solves both by leveraging a differentiable reward model — eliminating policy sampling and allowing the diffusion model to learn directly from the preference signal.

Components

Differentiable reward modelSource of training signal

A network (e.g. ImageReward based on CLIP/BLIP) trained on human preference data, returning a scalar R(prompt, image). Must be differentiable with respect to the input image.

Official

Diffusion model (UNet)Fine-tuned generator

Denoising network (typically UNet in Stable Diffusion or DiT in newer models) — the fine-tuning target. Its parameters (or LoRA adapters) are updated.

x₀ predictionBridge between denoising and reward model

Approximation of the final clean image from the intermediate noisy state x_t (formula depends on scheduler, e.g. DDIM). Needed so the reward model can evaluate the output.

Official

Late denoising step samplerSelects gradient computation point

Component sampling step t from a late range (usually last few of N) — a tradeoff between gradient quality (less noise, better x̂₀) and memory cost of backpropagation.

Official

Implementation

Reference implementations

ImageReward (official implementation)

Python · THUDM (Tsinghua University)

Official

Implementation pitfalls

Reward hackingCritical

Without regularization, the diffusion model quickly learns to produce artifacts that solely maximize reward, at the cost of realism and diversity.

Fix:Apply pretraining loss on early denoising steps, KL regularization against the base model, limit number of training steps.

High memory consumptionHigh

Backpropagation through multiple denoising steps requires storing all intermediate UNet activations — quickly exceeds VRAM even on A100/H100.

Fix:Gradient checkpointing, limit late_step_range, fine-tune via LoRA instead of full weights.

Diversity collapse (mode collapse)High

Optimizing for a scalar reward reduces generation diversity to a narrow distribution of images highly rated by the reward model.

Fix:Mixed batches with pretraining loss, use multiple reward models, early stopping.

Reward model biasMedium

Any biases in the human preference data on which the reward model was trained are transferred and amplified in the fine-tuned diffusion model.

Fix:Audit preference dataset, ensemble multiple reward models trained on different data.

Evolution

Original paper · 2023 · NeurIPS 2023 · Jiazheng Xu

ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Xu, Weiyun Zhang, Jie Tang, Yuxiao Dong

2023

Introduction of ReFL in the ImageReward paper

Inflection point

Xu et al. publish ImageReward and the ReFL algorithm as the first approach using a differentiable reward model to fine-tune diffusion models.

ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation (paper)

2023

DRaFT — gradient backpropagation through full trajectory

Clark et al. publish DRaFT, extending the ReFL idea by backpropagating gradients through more denoising steps.

Directly Fine-Tuning Diffusion Models on Differentiable Rewards (paper)

2023

AlignProp — stable reward gradient backpropagation through denoising

Prabhudesai et al. publish AlignProp with additional gradient stabilization techniques for long denoising chains.

Aligning Text-to-Image Diffusion Models with Reward Backpropagation (paper)

Technical details

Hyperparameters (configurable axes)

Late step rangeHigh

Range of denoising steps from which the reward gradient computation point is sampled.

1–10 ostatnich z 40

1–5 ostatnich

Reward loss weightHigh

Coefficient multiplying the reward model loss combined with regularization (usually pretraining loss).

Regularization strategyCritical

Method of preventing reward hacking: pre-training loss on early steps, KL to original model, LoRA constraints.

Batch sizeMedium

Memory-limited — backpropagation through denoising is memory-hungry.

Execution paradigm

Primary mode

dense

ReFL does not modify the diffusion model's execution paradigm — it remains dense. It only modifies the training phase.

Activation pattern

all_paths_active

Parallelism

Parallelism level

partially_parallel

Training is data-parallel but requires memory sufficient for backpropagation through multiple denoising steps — typically limits effective per-device batch size.

Scope

trainingacross_devices