The algorithm starts sampling chain-of-thought paths exactly like Self-Consistency. After each new sample it updates statistics: counts and frequencies of every unique answer. It then computes a Beta posterior over the probability that the current majority answer is the true population majority (a Dirichlet/Beta posterior over the multinomial of answers, with a non-informative or Jeffreys prior). If the posterior mass on this event exceeds a confidence threshold (e.g. 0.95), sampling is terminated and the majority answer is returned. Otherwise the algorithm continues up to the maximum sample cap K. Implementation amounts to a few lines of code on top of the standard Self-Consistency loop.
Self-Consistency uses a fixed number of samples K regardless of instance difficulty. For easy instances consensus is reached after only a few samples, making the rest wasted compute; for hard ones K may be too small. Adaptive-Consistency dynamically allocates sampling budget per-instance, stopping once the current majority answer is statistically confident enough.
A low threshold (e.g. 0.7) causes early stopping on apparent consensus, especially when the base LLM has strong but wrong preferences.
If mathematically equivalent answers have different string representations (e.g. "1/2" vs "0.5"), they are counted as different, noising the distribution and making the threshold harder to reach.
Checking the threshold after just 1–2 samples leads to an over-confident posterior and premature stopping, especially with a strong prior.
Wang et al. show that sampling many CoT paths and majority voting markedly improves LLM reasoning — the direct starting point for Adaptive-Consistency.
Aggarwal et al. introduce an adaptive stopping criterion based on a Beta posterior over the majority answer frequency and demonstrate ~3× sample reduction at parity quality.
Upper bound on CoT paths generated per instance (the role of K from Self-Consistency).
Posterior mass threshold on the event "the current majority answer is the true majority", above which sampling is stopped.
Choice of prior for the Beta/Dirichlet distribution (e.g. uniform, Jeffreys). Affects behavior with few samples.
Number of samples drawn before the algorithm starts checking the stopping criterion — guards against premature stopping on very small samples.
The number of paths actively generated depends on the current state of the answer distribution — a conditional paradigm with a dynamic budget.
Samples are only logically sequential — nothing prevents sampling in small batches and evaluating the stopping criterion after each batch, recovering parallelism at the cost of a small efficiency loss.
CoT samples can be batched on GPUs; evaluating the stopping criterion is negligibly cheap compared to generation cost.