Inference proceeds in rounds with an outer cap _MAX_OUTER. Phase 0: the controller opens n_init parallel reasoning branches. In each round it (1) computes pool statistics over completed answers (winner, top1, top2, Beta-majority confidence), (2) updates the EMA of pool confidence with parameter ema_alpha, (3) classifies active branches as aligned/deviant/neutral once warm_up has elapsed, incrementing disagree_rounds for deviants, (4) abandons branches whose disagree_rounds ≥ abandon_patience while keeping at least 2 active, (5) allocates probe_budget across active branches sorted by probe_count descending (most-invested first), with a burst_aligned multiplier for aligned branches, (6) updates the EMA again, (7) computes ema_delta = ema_history[-1] − ema_history[0] over a T_ema window, (8) evaluates the gate: gate_fires := warm_enough ∧ n_complete ≥ min_complete ∧ ema_conf ≥ conf_thresh ∧ ema_delta ≥ −delta_slack — if satisfied, returns the pool_winner, (9) if not stopped, checks widening: when ema_delta ≤ trend_thresh (flat/negative trend) and ema_conf < conf_thresh, spawns widen_burst new branches up to the max_branch_use cap. The loop terminates when the gate fires, when all branches are resolved, or when outer_step reaches _MAX_OUTER (then it returns the majority of the latest latest_ans values).
Earlier adaptive TTS controllers (ASC, ESC, Parallel-Probe, and early proposals IBC/SCR/DGCC) rely on instantaneous pool confidence: a lucky early cluster of identical answers can fire the stopping gate before the answer distribution has stabilised, causing premature stops. Furthermore, width decisions (how many branches to spawn) and depth decisions (how much to probe) are decoupled, so budget behavior fails to react to actual progress in evidence quality. CMC addresses both: the momentum gate requires high level AND non-decreasing EMA trend simultaneously, and widening is coupled to the same trend signal.
A high alpha (≈0.7) makes the EMA degenerate to near-instantaneous confidence, killing the anti-spike effect — the gate may fire on a single random answer cluster.
A naive port of branch abandonment may cut all but one — the controller then loses its answer pool to pick a winner from and degenerates to single-path.
The β schedule was discovered on Qwen3 + AIME24. Different backbones or task families may need conf_thresh / ema_alpha shifts to avoid regression.
Wang et al. introduce multi-path CoT sampling with majority voting — the baseline benchmark CMC ultimately surpasses on the Pareto frontier.
Aggarwal et al. introduce instantaneous Beta-majority confidence with a stopping threshold — the first adaptive baseline that CMC supersedes with a momentum gate.
Zheng et al. introduce an explicit width-and-depth axis with consensus-based pruning — a direct structural ancestor of CMC.
A coding agent in a replay environment iteratively discovers CMC: the first TTS controller with an EMA momentum gate and coupled width–depth control, establishing a new Pareto frontier on AIME25/HMMT25.
Scalar in [0, 1] driving every CMC hyperparameter monotonically: 0 = budget-conservative, 1 = accuracy-first.
Number of reasoning branches spawned in phase 0; increases with β (round(2 + 6·β)).
Hard upper cap on total branches spawned; increases with β (round(4 + 60·β), up to 64).
EMA coefficient on pool confidence; decreases with β (0.70 − 0.40·β) — higher β means more inertia.
EMA level required to stop; increases with β (0.85 + 0.12·β).
Tolerated negative EMA slope at stopping time; decreases with β (0.04 − 0.03·β).
Probe steps per round for aligned branches; increases with β (max(1, round(1 + 2·β))).
How many new branches to spawn when EMA trend is weak; increases with β (max(1, round(1 + 3·β))).
ema_delta threshold below which widening fires; decreases with β (0.04 − 0.03·β).
Number of deviant rounds before a branch is abandoned; increases with β (max(3, round(3 + 9·β))).
Rounds before gate evaluation and branch classification; increases with β (max(2, round(2 + 8·β))).
Window length for computing ema_delta; increases with β (max(2, round(2 + 6·β))).
Minimum number of completed answers before the gate may fire; increases with β (max(2, round(2 + 3·β))).
Conditional / dynamic mode: the number of active branches and probe allocation depend on the current EMA state and branch classification. All thresholds and multipliers are a deterministic function of the scalar β.
Probing and branch spawning are independent and map onto batched LLM execution; synchronization points (pool stats, EMA, gate evaluation) are negligible relative to decoding cost.
Parallel reasoning branches batch naturally on GPUs; EMA, classification and gate computations are negligible.