Inference

TTS

2024ActivePublished: 30 May 2026Updated: 30 May 2026Published

Key innovation

Scaling the amount of compute used at inference time — rather than at pretraining — as an alternative axis for improving model quality.

How it works

TTS comes in three broad flavours. (1) Parallel scaling: the model generates N independent samples and a verifier or aggregation rule picks the best — majority voting / self-consistency, best-of-N with a reward model, or re-ranking. (2) Sequential scaling: the model produces long explicit or hidden chains of thought, critiques its own answer and iteratively revises it (self-refinement, revisions). (3) Search-based scaling: beam search or MCTS over a tree of partial solutions, guided by a Process Reward Model (PRM) that scores each reasoning step. Snell et al. (2024) further introduced a "compute-optimal" strategy that allocates the test-time budget adaptively to prompt difficulty. Frontier reasoning models such as OpenAI o1/o3 and DeepSeek R1 internalise this paradigm: instead of running an external search procedure, they are trained with RL to produce very long reasoning chains during a single answer.

Problem solved

Classical scaling laws (Kaplan, Chinchilla) assumed that model quality grows mainly with parameters and training data. That kind of scaling is increasingly expensive and shows diminishing returns. Test-time scaling addresses the question of how to substantially improve answer quality after training is finished, by spending extra compute only on hard prompts instead of training a larger model.

Implementation

Implementation pitfalls

Reward hacking in PRMsHigh

Process Reward Models can be exploited by policies that produce text scoring high under the PRM but not actually leading to correct answers.

Fix:Combining PRMs with outcome reward models (ORMs), KL regularisation, evaluation on hard held-out sets.

Diminishing returns beyond some N / CoT lengthMedium

Best-of-N and CoT-length curves flatten out; without compute-optimal allocation it is easy to burn the budget without quality gains.

Fix:Difficulty-aware adaptive allocation, early stopping once verifier confidence is reached.

High inference latency and costMedium

TTS shifts cost from training onto every single inference, making it ill-suited for low-latency or high-throughput applications.

Fix:Routing easy queries to a cheaper model, applying TTS only to hard prompts (cascade / mixture-of-deciders).

Evolution

Original paper · 2024 · arXiv:2408.03314 (2024) · Charlie Snell

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell, Jaehoon Lee, Kelvin Xu, Aviral Kumar

2022

Chain-of-Thought prompting (Wei et al.)

Showed that explicit reasoning steps in the prompt substantially improve performance on math and logic — an early form of sequential test-time scaling.

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (paper)

2022

Self-Consistency (Wang et al.)

Sampling many reasoning chains and majority-voting the final answer — the canonical instance of parallel test-time scaling.

Self-Consistency Improves Chain of Thought Reasoning in Language Models (paper)

2023

Process Reward Models (Lightman et al., "Let's Verify Step by Step")

Training verifiers that score the correctness of every reasoning step, a key building block of test-time search.

Let's Verify Step by Step (paper)

2024

Snell et al. — compute-optimal test-time scaling

Inflection point

Formalised test-time scaling as a distinct scaling axis; showed that adaptive allocation can outperform a 14× larger model at matched FLOPs.

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters (paper)

2024

OpenAI o1 — internalised long-chain reasoning

Inflection point

Release of o1, whose performance scales both with RL training compute and with test-time "thinking" budget. Brought test-time scaling into consumer products.

Learning to reason with LLMs (paper)

2025

DeepSeek R1 — open-weights reasoning via RL

First widely available open-weights reasoning model with long RL-trained chain-of-thought, replicating the o1 effect in the open ecosystem.

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (paper)