The LLM searched for
a better strategy.
Researchers at UMD, UVA, UNC, Google, and Meta built a framework where a coding agent discovers how to guide LLM reasoning at inference time. Instead of hand-tuned heuristics for when to branch, probe, or stop a reasoning search, the agent proposes controller programs and tests them against a frozen cache of pre-collected traces, with no live model inference during search. The found strategy cuts inference tokens by roughly 70% at matched accuracy. The full discovery run cost $39.90.
First surfaced in Tandemly Briefing — 2026-05-20.
Hand-crafted heuristics
are the bottleneck now.
Test-time scaling works. The hard part is knowing how to scale. Every structural decision about when to branch or stop is hand-designed by researchers with no way to know how far from optimal their choices are.
Test-time scaling is the idea that you can get better LLM answers by spending more compute at inference time rather than at training time. The simplest version: run the same prompt through the model 64 times in parallel and take the majority vote. Better versions add structure. The model branches when it is uncertain, probes a hypothesis by sampling a few continuations before committing, and stops when it has converged on an answer with enough confidence.
These structural decisions are all hand-designed. A researcher decides what counts as "confident enough" to stop. An engineer picks the branching threshold. A team iterates on stopping criteria by running experiments, observing what seems to improve accuracy, and adjusting. These are reasonable engineering choices. They also have a fundamental problem: nobody knows how close they are to optimal.
The space of possible controller programs is enormous. Human intuition explores a small corner of it. There is no principled way to tell whether a hand-tuned threshold represents a near-optimal operating point or just the first parameter value that happened to work well enough to stop the search. The same gap appears with scale: a stopping criterion tuned for one model and benchmark often requires manual retuning when either changes, because the intuition that produced it does not transfer automatically.
What if you treated test-time scaling strategy design as a search problem rather than an engineering problem? Could an LLM coding agent search over controller programs more effectively than humans hand-tune heuristics, and could the search be made cheap enough to run routinely?
An agent searches for
the controller instead.
AutoTTS reformulates strategy design as program synthesis. A coding agent proposes controllers, a frozen replay cache evaluates them without any new model calls, and a single meta-parameter maps the full accuracy-efficiency tradeoff.
A controller is a piece of code. It looks at the current state of a reasoning search: how many active branches there are, what their confidence scores look like, how confidence has trended over recent steps. Based on that state, it decides what to do next: branch, continue, probe, prune, or stop. The question AutoTTS answers is: what is the best such program for a given base model and task distribution?
The key engineering insight is the replay environment. Before running any controller search, the researchers collected reasoning traces from the base model on AIME24 problems and stored them in a structured cache, partitioned into segments with probe responses materialized at branch points. A candidate controller can be evaluated against this frozen cache without calling the LLM at all. A controller that branches too aggressively wastes simulated tokens. One that stops too early misses solutions visible in the trace data. Both signals are available instantly, without any new model inference.
With this setup, a coding agent runs the search. It proposes a controller as a program, evaluates it on the replay cache, observes accuracy and token usage at the candidate's operating point, and revises the code. This loop repeats until convergence. Only the coding agent's own API calls cost money during the search; the base model is never queried. One complete discovery run takes 160 minutes and $39.90.
All internal thresholds in the discovered controller map deterministically from a single meta-parameter β ∈ [0,1]. At β=0, the controller optimizes for minimum token usage. At β=1, it pushes for maximum accuracy. This collapses the full accuracy-efficiency tradeoff into a single knob that deployment teams can set without understanding the controller's internals. Both operating points come from the same discovered program, not separate systems.
The discovered controller
beats everything handcrafted.
The agent converged on a controller called CMC that combines four mechanisms no prior baseline had used together. It generalized from AIME24 training to held-out AIME25 and HMMT25 benchmarks across four model scales.
The Confidence Momentum Controller combines four mechanisms the agent arrived at through iteration. Trend-based stopping: CMC maintains an exponential moving average of the pool's confidence and stops only when confidence is both high and trending upward, not just when it crosses a threshold. Coupled width-depth control: it widens the search when uncertain and deepens when converging. Alignment-aware depth allocation: it invests more depth in traces that agree with the emerging consensus. Conservative branch abandonment: it prunes only when a branch's confidence has clearly fallen and shows no recovery trend.
None of these mechanisms are individually novel. The notable result is that the agent arrived at this specific combination, which no prior handcrafted baseline had assembled, and that it outperformed every manually designed strategy tested.
At β=0.5, CMC reduces aggregate token usage by roughly 69.5% compared to SC@64 (self-consistency with 64 parallel samples) while maintaining matched average accuracy across four Qwen3 model scales: 45.3 versus 45.2. At β=1.0, CMC exceeds all handcrafted baselines on peak accuracy in 5 of 8 cases across model-benchmark pairs.
SC@64 represents the strong baseline these results are measured against. Running 64 parallel samples and taking the majority vote is already a well-established, costly approach. CMC at β=0.5 matches that accuracy at roughly 30% of the token cost.
CMC was discovered using only AIME24 problems. When tested on held-out benchmarks (AIME25 and HMMT25) across four Qwen3 scales, it generalized: it outperformed every handcrafted baseline on average accuracy in 3 of 4 model scales. On Qwen3-8B it remained competitive: 62.7 versus 62.8 for SC@64. The controller did not need retraining or retuning to transfer across held-out problems of the same general type.
All benchmark results are on math reasoning (AIME24, AIME25, HMMT25). Math is well-suited to this approach because correct answers have clear verification signals and difficulty has known structure. How well the approach generalizes to open-ended tasks, long-horizon agents, or retrieval-augmented workflows is not demonstrated by this paper.
The replay cache is also model- and task-specific. A controller discovered for Qwen3-72B on AIME24 is not guaranteed to transfer to a different model family or radically different task type without re-running discovery. At $39.90 per run, re-running is practical, but the step cannot be skipped when base model or task distribution changes significantly.
What this means
for production inference.
The discovery cost is low enough that strategy search is now a routine operation, not a research project. What changes is how you think about the work of building an efficient reasoning pipeline.
Where to go
from here.
Concrete steps to engage with the paper and try the approach yourself.