RLVR rests on three pillars. (1) A prompt set with verifiable answers — math with ground truth (e.g. AIME, MATH), code with unit tests (HumanEval, LiveCodeBench), format-strict instructions (IFEval — does the response contain a bullet list, JSON, exactly 5 sentences). (2) A reward function R(x, y) → {0, 1} (or a composition R = α·R_correct + β·R_format) implemented programmatically — e.g. SymPy for math expression comparison, a sandbox for `pytest` execution, regex for format checks. (3) A policy-gradient algorithm — Tülu 3 used PPO; DeepSeek-R1 used GRPO; other implementations use REINFORCE++. Training: model π_θ generates rollouts on verifier-equipped prompts, receives 0/1 reward, the policy is updated with a KL penalty to π_ref (usually SFT). Tülu 3 shows RLVR works not only on reasoning (math/code) but also on strict instruction-following — a domain where DPO and RLHF typically fail ("count exactly 5 words", "answer only YES or NO"). Key difference vs Reasoning RL: RLVR is a MIDDLE-tier abstraction — it describes a family of algorithms, not a specific implementation (Reasoning RL = paradigm for reasoning + RLVR as the mechanism; GRPO = specific algorithm within RLVR).
Classical RLHF requires a learned reward model — expensive to train, prone to overfitting and reward hacking, dependent on preference-pair quality. Yet for many tasks — math, code, strict format requirements — a natural verifier exists (`==`, `pytest`, regex) that produces a correctness signal without human labels. RLVR systematises this area: it defines reward functions as a first-class pipeline component, allows any policy-gradient algorithm (PPO, GRPO, REINFORCE++), and shows that for verifiable tasks RLVR yields a cleaner signal, less reward hacking risk, and is significantly cheaper than RLHF.
A programmatic function evaluating rollout y's correctness for prompt x. No learned parameters. Defines the whole of RLVR — everything else (algorithm, model, sampler) is interchangeable.
Official
Datasets with (prompt, ground truth) or (prompt, verifier function) pairs. Tülu 3 reveals its own mix: math (NuminaMath, MATH), code (LiveCodeBench), instruction-following (IFEval prompts).
Mechanism updating the policy based on verifier rewards. Any on-policy algorithm — PPO (Tülu 3), GRPO (DeepSeek), REINFORCE++ (Kimi).
Official
Critical infrastructure for code verifiers: subprocess with timeout, cgroups, network isolation, bans on unsafe imports. Verifier security is often overlooked, yet leaks/exploits from here are real.
Official
The most serious pitfall. Classic holes: hard-coding answers from tests (`assert answer == 42`), formatting answers to fool regex (e.g. always `\boxed{}`), generating short noisy answers that land by chance. Training "grows" in reward but the model gets worse.
The verifier executes LLM-generated code. Without isolation (timeout, network ban, cgroups, no filesystem writes) the model can affect training infrastructure, exfiltrate data, or destroy other tasks' rollouts.
Training only on math yields a great mathematician who fails on everything else. Tülu 3 explicitly shows that a domain mix (math + code + IFEval + general QA) is essential.
When all rollouts for a given prompt get the same reward (all correct or all wrong), advantage = 0 and the prompt contributes no gradient. Tülu 3 and DAPO filter such prompts (dynamic sampling) or raise the temperature to increase variance.
OpenAI establishes the RLHF standard with a learned reward model on preference pairs. RLVR will emerge as a counterpoint: reward as rule, not model.
DeepSeek introduces GRPO with rule-based rewards for mathematics. A practical pre-implementation of the RLVR idea, without the name.
Allen AI publishes Tülu 3 (arXiv:2411.15124, November 2024) and gives the paradigm its name: Reinforcement Learning with Verifiable Rewards. They show RLVR works not only on reasoning but also on precise instruction following (IFEval) — where DPO/RLHF fail.
January 2025: DeepSeek-R1 applies RLVR (via GRPO) on a 671B MoE and triggers a wave of reproductions. The "RLVR" term becomes broadly adopted in the open-source community.
RLVR becomes the standard post-training path for open-source LLMs with strong reasoning + instruction following. OLMoE-instruct, Llama 3 Tülu, OpenR1, and SimpleRL reproduce the pipeline at various scales.
Dedicated datasets emerge: TÜLU 3 SFT mix, IFEval-Hard, RewardBench v2 — each with its own verifier. Verifier-as-data-pipeline becomes its own discipline.
Time complexity: O(N · L · |θ|) sampling + O(N · V) verifier + O(N · L · |θ|) update. Space complexity: O(2 · |θ|) (policy + reference) + verifier RAM (mała).
The bottleneck depends on domain. Math/regex: LLM sampling dominates. Code: the verifier dominates (parallel in CPU sandboxes). For mixed domains the production pipeline has a separate GPU cluster for the LLM and a separate CPU cluster for verifier sandboxes, connected by an async queue.
The most important decision in RLVR. Composition: R = α·R_correct + β·R_format + γ·R_constraints. A weak/noisy verifier → task-level reward hacking (e.g. "write everything in boxed{}" instead of reasoning).
Choice of policy-gradient algorithm. PPO (Tülu 3), GRPO (DeepSeek), REINFORCE++ (Kimi k1.5). RLVR is algorithm-agnostic — it defines the reward signal, not the update rule.
Task domain with a verifier. Tülu 3 introduces a shadow domain — instruction-following — alongside classic math/code. A domain mix in training yields more universal models.
How to combine several reward components. Weighted sum, discrete composition (binary AND), or hierarchical (format first, then correctness).
For code verifiers: per-test timeout (typically 10–60s), network isolation, memory limits (cgroups), bans on unsafe imports. A secure sandbox is essential — the verifier executes LLM-generated code.
RLVR is a reward-signal mechanism, not a model structure. The policy and reference model are standard (dense or MoE). It can be combined with any architecture.
Production pipelines (Tülu 3, DeepSeek-R1) scale the verifier separately: a cluster of Python sandboxes executes pytest on rollouts asynchronously to the vLLM inference cluster.
RLVR is standard RL for LLMs — sampling and training dominate GPUs. The verifier runs on CPU alongside the GPU cluster (Python sandboxes).
Verifiers (pytest, SymPy, regex) are CPU-bound. Verifier clusters scale independently of training GPUs.
The method itself is agnostic — any RL framework + any verifier. Hardware specifics come from LLM + RL, not from RLVR per se.