TTS comes in three broad flavours. (1) Parallel scaling: the model generates N independent samples and a verifier or aggregation rule picks the best — majority voting / self-consistency, best-of-N with a reward model, or re-ranking. (2) Sequential scaling: the model produces long explicit or hidden chains of thought, critiques its own answer and iteratively revises it (self-refinement, revisions). (3) Search-based scaling: beam search or MCTS over a tree of partial solutions, guided by a Process Reward Model (PRM) that scores each reasoning step. Snell et al. (2024) further introduced a "compute-optimal" strategy that allocates the test-time budget adaptively to prompt difficulty. Frontier reasoning models such as OpenAI o1/o3 and DeepSeek R1 internalise this paradigm: instead of running an external search procedure, they are trained with RL to produce very long reasoning chains during a single answer.
Classical scaling laws (Kaplan, Chinchilla) assumed that model quality grows mainly with parameters and training data. That kind of scaling is increasingly expensive and shows diminishing returns. Test-time scaling addresses the question of how to substantially improve answer quality after training is finished, by spending extra compute only on hard prompts instead of training a larger model.
Process Reward Models can be exploited by policies that produce text scoring high under the PRM but not actually leading to correct answers.
Best-of-N and CoT-length curves flatten out; without compute-optimal allocation it is easy to burn the budget without quality gains.
TTS shifts cost from training onto every single inference, making it ill-suited for low-latency or high-throughput applications.
Showed that explicit reasoning steps in the prompt substantially improve performance on math and logic — an early form of sequential test-time scaling.
Sampling many reasoning chains and majority-voting the final answer — the canonical instance of parallel test-time scaling.
Training verifiers that score the correctness of every reasoning step, a key building block of test-time search.
Formalised test-time scaling as a distinct scaling axis; showed that adaptive allocation can outperform a 14× larger model at matched FLOPs.
Release of o1, whose performance scales both with RL training compute and with test-time "thinking" budget. Brought test-time scaling into consumer products.
First widely available open-weights reasoning model with long RL-trained chain-of-thought, replicating the o1 effect in the open ecosystem.
Number of independent samples drawn for best-of-N / self-consistency. Higher N improves quality at linearly higher compute cost.
Number of tokens spent on the internal chain-of-thought before the final answer. The main dial in o1/o3-style reasoning models ("thinking time").
Quality of the Process Reward Model or Outcome Reward Model used to score candidates or reasoning steps.
Number of active branches when searching over the reasoning tree (beam search, MCTS).
Test-time compute is allocated adaptively to prompt difficulty (compute-optimal scaling).
Best-of-N and self-consistency are fully parallel across samples; beam search over reasoning steps is sequential within a trajectory but multiple branches can be explored in parallel.
TTS is dominated by LLM decoding, which requires fast GPUs with high memory bandwidth and tensor cores.
TPUs handle batched decoding of many parallel samples well, matching best-of-N strategies.