Alignment

RLVR

2024ActivePublished: 10 June 2026Updated: 10 June 2026Published

Key innovation

Formalises the RL training family that uses solely deterministic, verifiable reward functions (math correctness, code execution, IFEval format constraints) instead of a learned reward model — a 0/1 signal at the end of a rollout, no model-level reward hacking, declarative control over behaviour.

How it works

RLVR rests on three pillars. (1) A prompt set with verifiable answers — math with ground truth (e.g. AIME, MATH), code with unit tests (HumanEval, LiveCodeBench), format-strict instructions (IFEval — does the response contain a bullet list, JSON, exactly 5 sentences). (2) A reward function R(x, y) → {0, 1} (or a composition R = α·R_correct + β·R_format) implemented programmatically — e.g. SymPy for math expression comparison, a sandbox for `pytest` execution, regex for format checks. (3) A policy-gradient algorithm — Tülu 3 used PPO; DeepSeek-R1 used GRPO; other implementations use REINFORCE++. Training: model π_θ generates rollouts on verifier-equipped prompts, receives 0/1 reward, the policy is updated with a KL penalty to π_ref (usually SFT). Tülu 3 shows RLVR works not only on reasoning (math/code) but also on strict instruction-following — a domain where DPO and RLHF typically fail ("count exactly 5 words", "answer only YES or NO"). Key difference vs Reasoning RL: RLVR is a MIDDLE-tier abstraction — it describes a family of algorithms, not a specific implementation (Reasoning RL = paradigm for reasoning + RLVR as the mechanism; GRPO = specific algorithm within RLVR).

Problem solved

Classical RLHF requires a learned reward model — expensive to train, prone to overfitting and reward hacking, dependent on preference-pair quality. Yet for many tasks — math, code, strict format requirements — a natural verifier exists (`==`, `pytest`, regex) that produces a correctness signal without human labels. RLVR systematises this area: it defines reward functions as a first-class pipeline component, allows any policy-gradient algorithm (PPO, GRPO, REINFORCE++), and shows that for verifiable tasks RLVR yields a cleaner signal, less reward hacking risk, and is significantly cheaper than RLHF.

Components

Verifier (rule-based reward function)Reward-model-free training signal

A programmatic function evaluating rollout y's correctness for prompt x. No learned parameters. Defines the whole of RLVR — everything else (algorithm, model, sampler) is interchangeable.

INPrompt + full model response.

OUTScalar reward — typically binary or a weighted sum of binary components.

Math equality (SymPy)Compare formula with ground truth.

Code execution (pytest)Run unit tests in a sandbox.

Format regex (IFEval)Check strict format requirements.

Symbolic solver (Lean)Formal proof verification.

Compositional RSum α·R_correct + β·R_format.

Official

Verifier-equipped prompt setExternal signal — defines RLVR's skill scope

Datasets with (prompt, ground truth) or (prompt, verifier function) pairs. Tülu 3 reveals its own mix: math (NuminaMath, MATH), code (LiveCodeBench), instruction-following (IFEval prompts).

Policy-gradient algorithmPluggable policy-update implementation

Mechanism updating the policy based on verifier rewards. Any on-policy algorithm — PPO (Tülu 3), GRPO (DeepSeek), REINFORCE++ (Kimi).

Official

Sandbox / execution environmentSecure isolation of LLM code execution

Critical infrastructure for code verifiers: subprocess with timeout, cgroups, network isolation, bans on unsafe imports. Verifier security is often overlooked, yet leaks/exploits from here are real.

Official

Implementation

Reference implementations

allenai/open-instruct (Tülu 3 RLVR pipeline)

Python (PyTorch) · Allen Institute for AI

Official

huggingface/open-r1 (open RLVR reproduction)

Python · Hugging Face

volcengine/verl (GRPO/PPO with verifiers)

Python (Ray) · ByteDance Volcengine

Hugging Face TRL (GRPOTrainer + reward functions)

Python (PyTorch) · Hugging Face

Implementation pitfalls

Reward hacking — model exploits a verifier holeCritical

The most serious pitfall. Classic holes: hard-coding answers from tests (`assert answer == 42`), formatting answers to fool regex (e.g. always `\boxed{}`), generating short noisy answers that land by chance. Training "grows" in reward but the model gets worse.

Fix:Composition of multiple R components; manual rollout audits in early iterations; held-out benchmark different from training; code sandbox banned from reading tests.

Unsafe sandbox for code verifierCritical

The verifier executes LLM-generated code. Without isolation (timeout, network ban, cgroups, no filesystem writes) the model can affect training infrastructure, exfiltrate data, or destroy other tasks' rollouts.

Fix:Subprocess sandbox with 10–60s timeout; cgroups on memory and CPU; no network access; whitelist of standard libraries; isolated filesystem.

Narrow-domain verifier → narrow modelHigh

Training only on math yields a great mathematician who fails on everything else. Tülu 3 explicitly shows that a domain mix (math + code + IFEval + general QA) is essential.

Fix:Mix domains in mini-batches; supplement RLVR with a small RLHF/DPO fraction for general quality; measure per-domain separately.

No signal on zero reward varianceMedium

When all rollouts for a given prompt get the same reward (all correct or all wrong), advantage = 0 and the prompt contributes no gradient. Tülu 3 and DAPO filter such prompts (dynamic sampling) or raise the temperature to increase variance.

Fix:Dynamic sampling — drop zero-variance prompts from the batch; raise the temperature; difficulty curriculum.

Evolution

Original paper · 2024 · arXiv:2411.15124 (Allen Institute for AI, 2024) · Nathan Lambert

Tülu 3: Pushing Frontiers in Open Language Model Post-Training

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V. Miranda, Yizhong Wang, Allen AI Tülu 3 team

2022

InstructGPT / RLHF — learned reward model

OpenAI establishes the RLHF standard with a learned reward model on preference pairs. RLVR will emerge as a counterpoint: reward as rule, not model.

RLHF (concept)

2024

DeepSeekMath and GRPO

DeepSeek introduces GRPO with rule-based rewards for mathematics. A practical pre-implementation of the RLVR idea, without the name.

GRPO (concept)

2024

Tülu 3 — formal introduction of the "RLVR" name

Inflection point

Allen AI publishes Tülu 3 (arXiv:2411.15124, November 2024) and gives the paradigm its name: Reinforcement Learning with Verifiable Rewards. They show RLVR works not only on reasoning but also on precise instruction following (IFEval) — where DPO/RLHF fail.

Tülu 3: Pushing Frontiers in Open Language Model Post-Training (paper)

2025

DeepSeek-R1 — RLVR at extreme scale

January 2025: DeepSeek-R1 applies RLVR (via GRPO) on a 671B MoE and triggers a wave of reproductions. The "RLVR" term becomes broadly adopted in the open-source community.

Reasoning RL (concept)

2025

Llama 3 Reward Bench, OpenR1, SimpleRL — idea diffusion

RLVR becomes the standard post-training path for open-source LLMs with strong reasoning + instruction following. OLMoE-instruct, Llama 3 Tülu, OpenR1, and SimpleRL reproduce the pipeline at various scales.

2025

Emergence of RLVR-aware datasets and benchmarks

Dedicated datasets emerge: TÜLU 3 SFT mix, IFEval-Hard, RewardBench v2 — each with its own verifier. Verifier-as-data-pipeline becomes its own discipline.

RLVR

How it works

Problem solved

Components

Implementation

Evolution

Computational complexity

Compute bottleneck

Hyperparameters (configurable axes)

Execution paradigm

Parallelism

Hardware requirements