AutoTTS Cuts LLM Token Usage by 69.5% with AI-Designed Reasoning Strategies

Researchers from Meta, Google, and several universities have published AutoTTS — a framework that automatically discovers optimal test-time scaling strategies for language models. In experiments on Qwen3 and DeepSeek models, AutoTTS reduced token consumption by 69.5% while maintaining the same accuracy, and the entire strategy discovery process cost $39.90 and took 160 minutes.

Key takeaways

AutoTTS cuts token usage by 69.5% vs Self-Consistency with 64 reasoning paths
Discovering an optimal strategy costs $39.90 and takes 160 minutes
On GPQA-Diamond: 510K tokens → 151K tokens with a slight accuracy improvement
Framework tested on Qwen3 (0.6B–8B) and DeepSeek-R1 (8B distill)
Code and Confidence Momentum Controller available on GitHub

The problem: hand-crafted reasoning strategies

Test-time scaling (TTS) improves language model quality by allocating extra compute during inference (the stage when a trained model generates a response, as opposed to training). Rather than generating an answer in a single pass, the model can explore multiple reasoning paths, evaluate intermediate steps, and select the best final result. This is the core mechanism behind reasoning models and LLMs with Chain-of-Thought, enabling small models to rival much larger counterparts on difficult math and logic tasks.

The problem is that all TTS strategies have historically been hand-crafted. An engineer had to intuitively decide when a model should branch into new reasoning paths, when to deepen a current path, when to prune unpromising branches, and when to stop computing altogether. Width (number of parallel paths) and depth (how far each develops) were parameters tuned by humans — leaving a massive space of potential approaches unexplored.

Existing algorithms like Self-Consistency (SC), Adaptive-Consistency (ASC), and Parallel-Probe are sound but constrained by human intuition. AutoTTS attacks this limitation directly.

AutoTTS: strategy search as an algorithmic problem

Instead of asking an engineer to design a strategy, AutoTTS reframes the task as a search problem for an AI agent. The human's role becomes defining the discovery environment: the state-action space, the optimization objective (balancing accuracy against cost), and the feedback mechanism. From there, an autonomous agent — a language model acting as an 'explorer' — iteratively proposes, tests, and refines compute management strategies.

The key to affordability is an offline replay environment. Rather than running the base model every time the agent tests a new strategy, AutoTTS operates on thousands of reasoning trajectories collected offline in advance. Each trajectory contains 'probe signals' — intermediate answers that help evaluate reasoning progress without generating new tokens. This makes the full discovery cycle cheap: $39.90 per run.

A controller a human wouldn't design

The most interesting result of AutoTTS is not the benchmark score, but the nature of the discovered strategy. The agent proposed a controller called the Confidence Momentum Controller (CMC), which combines three mechanisms rarely found together in hand-crafted algorithms.

First: stop based on the trend, not a single reading. While reasoning, the model continuously assesses how confident it is in its current answer — its "confidence." Hand-crafted strategies stopped the model the moment that confidence crossed a threshold — a bit like discharging a patient after one good thermometer reading. The problem is that confidence can spike briefly without any real reason, so the model would stop early with the wrong answer. Instead of looking at a single reading, CMC tracks the average confidence over the last few steps (formally: an exponential moving average, EMA) and stops the model only once that average is both high and holding steady.

Second: width and depth controlled together, not separately. While reasoning, the model has two levers: it can explore sideways (try many different approaches in parallel — that is "width") or dig deeper into one approach (that is "depth"). Traditional algorithms set both parameters independently and up front, before reasoning starts. CMC works dynamically: when it sees current paths stalling — their confidence flat or declining — it spawns new branches on its own, without waiting for an external signal. A bit like a brainstorming session where the facilitator adds fresh ideas exactly when the existing ones get stuck.

Third: extra compute where consensus is forming. When several parallel reasoning paths start converging on the same answer, CMC stops splitting resources evenly. It identifies the branches backing the emerging consensus and gives them priority access to additional compute. The remaining branches keep working in the background, but it is the consensus that gets verified first. A bit like a discussion moderator who notices that four out of ten participants are independently reaching the same conclusion, and gives those four extra time to flesh out their reasoning.

Results: fewer tokens, same or higher accuracy

Experiments were conducted on Qwen3 models (0.6B–8B) by the Meta / Google research team and on an 8B distillation of DeepSeek-R1. The strategy was discovered on the AIME24 benchmark and tested on AIME25, HMMT25, and GPQA-Diamond.

In cost-conscious mode, AutoTTS cut token consumption by 69.5% compared to SC@64 while maintaining the same average accuracy across four Qwen3 models. When compute budget was increased, AutoTTS improved peak accuracy above all hand-crafted baselines in five of eight test cases. On GPQA-Diamond, token usage dropped from 510K to 151K with a slight accuracy improvement. On the DeepSeek model, AutoTTS achieved the highest overall accuracy on HMMT25 while cutting token spend nearly in half.

Why this matters

Inference costs are one of the primary bottlenecks in deploying reasoning models at production scale. Every response requiring tens of thousands of tokens has a direct impact on AI service margins. AutoTTS shows that optimizing these costs does not have to be a manual, time-consuming process — it can be automated for a few dozen dollars.

Equally significant is the shift in the engineer's role: instead of designing detailed heuristics, they define the environment and the success criteria, and the agent does the rest. This approach could transfer to other areas of ML optimization where the solution space is too large for manual exploration.

The democratization aspect is also notable: small teams without a dedicated research budget can now build inference strategies tailored to their own models and tasks in a single afternoon. The barrier to advanced TTS optimization has just dropped dramatically.

What's next?

AutoTTS framework and CMC are available on GitHub as open source — production implementations may appear within weeks.
Researchers point to potential extensions of AutoTTS to multi-model strategies and tasks beyond mathematics (e.g., code generation, legal reasoning).
Companies deploying reasoning models can start testing AutoTTS on their own models and internal benchmarks now — the framework requires no changes to the base model.

Sources

VentureBeat — Researchers automated LLM reasoning strategy design and cut token usage by 69.5%
arXiv — AutoTTS: Automated Test-Time Scaling for Large Language Models
GitHub — AutoTTS repository (zhengkid/AutoTTS)