Training

RFT

2023ActivePublished: 3 May 2026Updated: 3 May 2026Published

Key innovation

Fine-tunes a pre-trained model on domain-specific tasks using reinforcement learning rewards, improving task accuracy without general RLHF preference alignment.

How it works

The model is run on a set of domain-specific tasks. Each response is evaluated by an objective scorer. A policy gradient (e.g., PPO or GRPO) is computed from the reward signal and used to update the model's weights. The process iterates until convergence.

Problem solved

General RLHF models excel at following instructions but are not optimized for specific tasks with measurable outcomes. RFT addresses the gap between general helpfulness and task-specific accuracy.

Implementation

Implementation pitfalls

Reward hacking — model optimizes proxy instead of true objectiveMedium

The model may find behaviors yielding high reward without satisfying designer intent (e.g. generating long, confident-sounding but wrong answers). Requires careful reward function design.

KL penalty must be carefully calibratedMedium

Too small KL penalty → model drifts far from SFT baseline, losing linguistic coherence. Too large → model does not learn from rewards. Optimal β depends on task and data.

Training instability with sparse positive rewardsMedium

For hard tasks (e.g. complex math problems) the model rarely receives rewards — high gradient variance leads to unstable or slow-converging training.

Evolution

Original paper · 2024 · arXiv 2024 · Aviral Kumar

Training Language Models to Self-Correct via Reinforcement Learning

Aviral Kumar, Vincent Zhuang, Rishabh Agarwal

Sources

Reinforcement Fine-Tuning | OpenAI API

Documentation

OpenAI