The model is run on a set of domain-specific tasks. Each response is evaluated by an objective scorer. A policy gradient (e.g., PPO or GRPO) is computed from the reward signal and used to update the model's weights. The process iterates until convergence.
General RLHF models excel at following instructions but are not optimized for specific tasks with measurable outcomes. RFT addresses the gap between general helpfulness and task-specific accuracy.
The model may find behaviors yielding high reward without satisfying designer intent (e.g. generating long, confident-sounding but wrong answers). Requires careful reward function design.
Too small KL penalty → model drifts far from SFT baseline, losing linguistic coherence. Too large → model does not learn from rewards. Optimal β depends on task and data.
For hard tasks (e.g. complex math problems) the model rarely receives rewards — high gradient variance leads to unstable or slow-converging training.
RFT requires simultaneously maintaining actor, critic, and reference (SFT baseline) models in memory — typically 4-8× A100/H100 for 7-70B models.