The model is run N times in parallel (or sequentially with different seeds/temperatures). Outputs are evaluated by an external scorer (reward model, verifier, majority voting, or best-of-N heuristic). The highest-scoring response is returned to the user.
Standard next-token sampling produces a single response of limited quality. Parallel TTC trades additional time and compute cost for higher accuracy.
Majority voting works for categorical answers, but for open-ended generation (essays, code) there is no simple aggregation method. Best-of-N requires a strong verifier/reward.
N parallel samples = Nร inference cost. At high N the quality gain saturates while cost grows โ the optimal cost/quality ratio point must be empirically determined.
The number of candidates (N) can be fixed or scale adaptively with query difficulty.
Each candidate can be generated independently on a separate GPU/TPU.
Each candidate can be generated on a dedicated GPU; NVLink-based architectures (e.g., NVIDIA GB200 NVL72) enable efficient parallelization.