Test-Time Scaling · Agentic Discovery

The LLM searched for
a better strategy.

Researchers at UMD, UVA, UNC, Google, and Meta built a framework where a coding agent discovers how to guide LLM reasoning at inference time. Instead of hand-tuned heuristics for when to branch, probe, or stop a reasoning search, the agent proposes controller programs and tests them against a frozen cache of pre-collected traces, with no live model inference during search. The found strategy cuts inference tokens by roughly 70% at matched accuracy. The full discovery run cost $39.90.

Core concept
Controller synthesis: an LLM coding agent searches for the program that governs when to branch, continue, probe, or stop a reasoning search, by testing candidates against a frozen replay cache rather than running live inference. The result is a discovered controller that can be parameterized along the accuracy-efficiency tradeoff with a single scalar.

First surfaced in Tandemly Briefing — 2026-05-20.

scroll to explore

Hand-crafted heuristics
are the bottleneck now.

Test-time scaling works. The hard part is knowing how to scale. Every structural decision about when to branch or stop is hand-designed by researchers with no way to know how far from optimal their choices are.

Test-time scaling is the idea that you can get better LLM answers by spending more compute at inference time rather than at training time. The simplest version: run the same prompt through the model 64 times in parallel and take the majority vote. Better versions add structure. The model branches when it is uncertain, probes a hypothesis by sampling a few continuations before committing, and stops when it has converged on an answer with enough confidence.

These structural decisions are all hand-designed. A researcher decides what counts as "confident enough" to stop. An engineer picks the branching threshold. A team iterates on stopping criteria by running experiments, observing what seems to improve accuracy, and adjusting. These are reasonable engineering choices. They also have a fundamental problem: nobody knows how close they are to optimal.

The space of possible controller programs is enormous. Human intuition explores a small corner of it. There is no principled way to tell whether a hand-tuned threshold represents a near-optimal operating point or just the first parameter value that happened to work well enough to stop the search. The same gap appears with scale: a stopping criterion tuned for one model and benchmark often requires manual retuning when either changes, because the intuition that produced it does not transfer automatically.

The question this paper asks

What if you treated test-time scaling strategy design as a search problem rather than an engineering problem? Could an LLM coding agent search over controller programs more effectively than humans hand-tune heuristics, and could the search be made cheap enough to run routinely?

An agent searches for
the controller instead.

AutoTTS reformulates strategy design as program synthesis. A coding agent proposes controllers, a frozen replay cache evaluates them without any new model calls, and a single meta-parameter maps the full accuracy-efficiency tradeoff.

A controller is a piece of code. It looks at the current state of a reasoning search: how many active branches there are, what their confidence scores look like, how confidence has trended over recent steps. Based on that state, it decides what to do next: branch, continue, probe, prune, or stop. The question AutoTTS answers is: what is the best such program for a given base model and task distribution?

The key engineering insight is the replay environment. Before running any controller search, the researchers collected reasoning traces from the base model on AIME24 problems and stored them in a structured cache, partitioned into segments with probe responses materialized at branch points. A candidate controller can be evaluated against this frozen cache without calling the LLM at all. A controller that branches too aggressively wastes simulated tokens. One that stops too early misses solutions visible in the trace data. Both signals are available instantly, without any new model inference.

With this setup, a coding agent runs the search. It proposes a controller as a program, evaluates it on the replay cache, observes accuracy and token usage at the candidate's operating point, and revises the code. This loop repeats until convergence. Only the coding agent's own API calls cost money during the search; the base model is never queried. One complete discovery run takes 160 minutes and $39.90.

Before: manual strategy design
Researchers hand-tune thresholds. Try a stopping criterion, run a benchmark, adjust based on results. Requires live model inference at every evaluation step. Intuition-driven, slow to transfer across models, no bound on how far from optimal the result is.
After: AutoTTS discovery loop
A coding agent proposes and tests controller programs. Evaluation runs against a frozen trace cache with zero new LLM calls. The agent iterates in code, not experiments. One run: 160 minutes, $39.90, a controller that outperforms every handcrafted baseline.
Discovery cost
$39.90
one complete run
Wall-clock time
160
minutes per run
Token reduction
69.5%
vs SC@64 at matched accuracy
LLM calls during eval
0
replay cache only
What the beta parameter does

All internal thresholds in the discovered controller map deterministically from a single meta-parameter β ∈ [0,1]. At β=0, the controller optimizes for minimum token usage. At β=1, it pushes for maximum accuracy. This collapses the full accuracy-efficiency tradeoff into a single knob that deployment teams can set without understanding the controller's internals. Both operating points come from the same discovered program, not separate systems.

The discovered controller
beats everything handcrafted.

The agent converged on a controller called CMC that combines four mechanisms no prior baseline had used together. It generalized from AIME24 training to held-out AIME25 and HMMT25 benchmarks across four model scales.

What CMC does

The Confidence Momentum Controller combines four mechanisms the agent arrived at through iteration. Trend-based stopping: CMC maintains an exponential moving average of the pool's confidence and stops only when confidence is both high and trending upward, not just when it crosses a threshold. Coupled width-depth control: it widens the search when uncertain and deepens when converging. Alignment-aware depth allocation: it invests more depth in traces that agree with the emerging consensus. Conservative branch abandonment: it prunes only when a branch's confidence has clearly fallen and shows no recovery trend.

None of these mechanisms are individually novel. The notable result is that the agent arrived at this specific combination, which no prior handcrafted baseline had assembled, and that it outperformed every manually designed strategy tested.

Accuracy and efficiency results

At β=0.5, CMC reduces aggregate token usage by roughly 69.5% compared to SC@64 (self-consistency with 64 parallel samples) while maintaining matched average accuracy across four Qwen3 model scales: 45.3 versus 45.2. At β=1.0, CMC exceeds all handcrafted baselines on peak accuracy in 5 of 8 cases across model-benchmark pairs.

SC@64 represents the strong baseline these results are measured against. Running 64 parallel samples and taking the majority vote is already a well-established, costly approach. CMC at β=0.5 matches that accuracy at roughly 30% of the token cost.

Generalization to held-out benchmarks

CMC was discovered using only AIME24 problems. When tested on held-out benchmarks (AIME25 and HMMT25) across four Qwen3 scales, it generalized: it outperformed every handcrafted baseline on average accuracy in 3 of 4 model scales. On Qwen3-8B it remained competitive: 62.7 versus 62.8 for SC@64. The controller did not need retraining or retuning to transfer across held-out problems of the same general type.

Scope and limitations

All benchmark results are on math reasoning (AIME24, AIME25, HMMT25). Math is well-suited to this approach because correct answers have clear verification signals and difficulty has known structure. How well the approach generalizes to open-ended tasks, long-horizon agents, or retrieval-augmented workflows is not demonstrated by this paper.

The replay cache is also model- and task-specific. A controller discovered for Qwen3-72B on AIME24 is not guaranteed to transfer to a different model family or radically different task type without re-running discovery. At $39.90 per run, re-running is practical, but the step cannot be skipped when base model or task distribution changes significantly.

What this means
for production inference.

The discovery cost is low enough that strategy search is now a routine operation, not a research project. What changes is how you think about the work of building an efficient reasoning pipeline.

1
For engineers running LLM inference at scale
If your current test-time scaling strategy is hand-tuned self-consistency or a manually designed search heuristic, AutoTTS provides a credible path to recovering 60 to 70% of inference tokens without degrading accuracy. The approach is model-agnostic in principle. What changes is the trace cache you build for your specific model and problem distribution.
2
For teams managing multiple model scales or deployment targets
The β-parameterization pattern is worth adopting independent of the specific controller. Collapsing a multi-threshold configuration into one knob makes it far easier to hand the operating-point decision to infrastructure or product teams. They need to know what β satisfies their cost budget, not what an exponential moving average threshold does.
3
For researchers developing reasoning strategies
The replay environment pattern separates controller evaluation from model inference cleanly. Any iterative strategy development that currently requires live LLM calls for evaluation is slower than it needs to be, if the task distribution is stable enough to pre-collect traces. The discovery cost savings come directly from this architectural separation, not from the specific controller that was found.
4
A note on scope
The evidence here is strong for structured math reasoning. The pattern is plausible for other verifiable tasks like code generation or formal proofs, where correct-answer signals exist. Treat the results as good evidence for those settings and look for additional validation before applying to tasks without clear correctness criteria.

Where to go
from here.

Concrete steps to engage with the paper and try the approach yourself.

1
Read the paper
Zheng, T., Liu, H., Huang, C., Bao, H., Zhang, S., Liu, R., Dai, R., Chen, R., Liu, C., Xiong, T., Wu, X., Zhang, H., & Huang, H. (2026). LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling. University of Maryland, University of Virginia, UNC & collaborators. arXiv:2605.08083.
2
Access the code
The official AutoTTS repository is at github.com/zhengkid/AutoTTS. It includes the replay environment construction pipeline, the CMC controller implementation, and instructions for running the discovery loop on your own trace cache.
3
Build a trace cache for your model and task
Collect 50 to 100 representative problems from your target task distribution. Run your base model on each with high N (64 or more parallel samples). Store the traces in the format the replay environment expects. This is the one step that requires live model inference; everything downstream is cache-only.
4
Run controller discovery at beta=0.5 first
Start with the middle operating point (β=0.5) as a baseline. Evaluate the discovered controller on a held-out problem set from the same distribution and compare token usage and accuracy against SC@64 and any existing strategy you use. The β=0.5 result is where the largest efficiency gain sits; β=1.0 is for when peak accuracy is the primary constraint.
5
Validate on out-of-distribution problems before deploying
Following the paper's methodology, test the discovered controller on problem sets drawn from different distributions than the training cache (different benchmark, different difficulty tier). Generalization on AIME25 and HMMT25 was strong in the paper's results, but your task domain may have different transfer properties. Confirm before committing to the discovered controller in production.