First surfaced in Tandemly Briefing — 2026-05-18.

Agent Architecture · Inference-Time Scaling

When you can check the answer,
weak models catch up.

Researchers at MIT and Texas A&M connected a familiar engineering pattern, running many AI attempts and picking the best one, to the classical theory of boosting. The result is a formal account of when and why agent committees work. And a clear explanation of when they don't.

Core concept
The local verifier condition: an inference-time committee of weak models can match frontier-model accuracy when the task provides a way to evaluate candidate answers without access to ground truth — such as running tests, checking proofs, or validating against constraints.
scroll to explore

More agents don't
automatically help.

There's a common belief in AI engineering: if one model call might be wrong, run several and pick the best. That belief is sometimes right. Nobody had formally explained when, or why it fails.

Inference-time orchestration has become a standard tool for AI builders. Run the same task through multiple model calls, then use a critic or comparator to select the best output. It's the logic behind best-of-N sampling, self-consistency decoding, and agent committee designs. Teams apply it across code generation, mathematical reasoning, and information extraction, often without a clear theory of when it earns its overhead.

The problem is that the intuition breaks down in specific, predictable ways. Sometimes a pool of eight model calls produces one correct answer and the system cannot identify it. Sometimes more diversity in the pool means more varied wrong answers rather than better coverage of the solution space. The four failure modes had names in machine learning for decades: poor coverage, weak local identifiability, slow progress, and low diversity. They had not been formally connected to the inference-time orchestration patterns that practitioners were already building.

The question Sunkaraneni, Beneventano, Neumarker, Poggio, and Galanti set out to answer: under what formal conditions can a committee of weaker models reliably match or beat a single stronger model? The answer turned out to hinge on a property of the task, not just the models.

The question this paper asks

When does running multiple weak models in a committee genuinely outperform running one strong model? And when it fails, which of four identifiable conditions is responsible?

Boosting theory, applied
to agent pools.

The team connected inference-time committees to classical boosting, a technique from 1990s machine learning that combines many weak classifiers into a strong one. They formalized what had been a collection of ad hoc engineering practices.

Classical boosting, developed through AdaBoost and related algorithms, shows that you can reach high classification accuracy by running many imperfect models in sequence, each compensating for the prior's errors. The key guarantee: if each model is slightly better than random, you can combine them into something much stronger. The question the researchers asked was whether the same framework could describe what happens when you run multiple LLM calls and pick the best answer.

To formalize this, they separated the committee into four properties. Each property describes a different condition the committee must satisfy for boosting to work. Each also points to a different failure mode and a different fix.

1
Coverage
Does at least one model in the pool produce a correct answer? Coverage can be amplified by repeated sampling and diverse prompting strategies. It is the necessary condition, not the sufficient one.
2
Local identifiability
Can the system recognize which candidate answer is correct, without access to the ground-truth label? This is the hard part. It requires the task to supply a local soundness signal: execution results, proof checker output, type system feedback, test suite pass rates, or constraint solver verdicts.
3
Progress
Does each proposal attempt meaningfully advance toward the solution? Poor progress means the committee is generating varied attempts that don't build toward a correct answer, often a symptom of a weak proposer or a poorly structured task decomposition.
4
Diversity
Do different models in the pool cover different parts of the solution space? Without diversity, adding more proposers to the committee provides diminishing returns. The committee produces varied output, but the variation is in style rather than in what's attempted.
The critical distinction: coverage vs identifiability

Coverage is the easy part: with enough calls, at least one will often be correct. Identifiability is the hard part: how does the system know which one? In classical supervised boosting, labels are provided during training. In inference-time orchestration, there is no label. Something in the task environment must serve as the signal. The paper proves that without a local soundness signal, you can amplify coverage but you cannot build a reliable critic or comparator.

67% solo, 76.4% with
a committee. Same nano model.

On SWE-bench Verified, a benchmark of real software engineering tasks from open-source GitHub issues, the empirical results matched the theory cleanly.

Single nano model
67.0%
SWE-bench Verified baseline
k=8 committee, same nano model
76.4%
Matches Gemini 3 Pro, Claude Opus 4.5 Thinking
Oracle best-of-8 ceiling
79.0%
Upper bound if verifier is perfect

Running the same nano-scale model eight times, with a critic-comparator orchestration layer using the test suite as the local verifier, brought accuracy from 67.0% to 76.4%. That result matches the standalone performance of Gemini 3 Pro and Claude Opus 4.5 Thinking, both orders of magnitude larger. The key enabler: SWE-bench tasks come with test suites. Those test suites provide exactly the local soundness signal the theory requires for identifiability.

The gap between the orchestrated committee (76.4%) and the oracle ceiling (79.0%) represents identifiability failures: cases where a correct answer was in the pool but the critic-comparator could not reliably identify it. The 2.6-point gap is a measurement of how often local soundness signals are imperfect, even in domains with test suites. Not every test suite catches every class of error.

Common assumption
Bigger model, better results. When accuracy needs to improve, scale up to a larger model. Agent committees are nice-to-have, not a primary cost lever.
What the paper shows
Verifier quality is the binding constraint. For tasks with local verifiers, a pool of small models can match frontier accuracy. The committee is a cost lever, not just an accuracy booster.
The diagnostic value of the framework

When a committee underperforms expectations, the four properties tell you where to look. Low coverage: add more diverse proposers or change prompting strategies. Identifiability bottleneck: improve the critic or invest in a richer local verifier. Slow progress: restructure how proposers approach the task. Low diversity: vary the prompting, temperature, or model variants in the pool. Each question points to a different intervention.

Scope and limitations

The framework applies only to verifiable tasks: code with test suites, proofs with proof checkers, synthesis problems with validators, constraint satisfaction with solvers. Open-ended tasks without local feedback mechanisms, such as prose generation, strategic advice, or creative work, do not meet the local identifiability condition. The theory does not claim those tasks are unsolvable, only that the boosting analysis does not apply to them. The empirical results are on code repair (SWE-bench Verified); theorem proving and program synthesis are discussed theoretically.

What this means
for builders.

This paper formalizes a pattern many production teams already use informally. The contribution is less "here's a new technique" and more "here's when the technique you're using actually works." That distinction matters as much for predicting failures as for replicating successes.

1
Check for a local verifier before building the committee
Before designing a multi-model orchestration layer, determine whether your task has a local soundness signal. Tests for code. Proof checkers for formal reasoning. Validators for synthesis. Constraint solvers for planning. If you have one, a committee design can meaningfully improve on single-call accuracy. Without one, more calls are unlikely to reliably help, because the system has no way to identify which answer is correct.
2
Use the four properties as a diagnostic, not just a design guide
When an existing committee underperforms, run 20-30 tasks and ask each question in sequence. How often does any model in the pool produce a correct answer? How often does the system pick the wrong answer when a right one exists? How different are the model outputs from each other? The answers point to different interventions: proposer diversity, critic quality, local verifier richness, or task decomposition structure.
3
Reconsider your cost-accuracy calculation for verifiable tasks
The SWE-bench results imply a qualitatively different cost calculation for tasks with test suites or proof checkers. A pool of small-model calls plus a critic layer may match frontier-model accuracy at substantially lower per-task cost. This requires measuring the actual cost and accuracy of both approaches on your workload, not extrapolating from provider benchmarks that don't include orchestration overhead.
4
For theorem proving and formal verification use cases
Proof assistants like Lean, Isabelle, and Coq are natural fits for this framework. The proof checker is the local verifier. A committee of smaller reasoning models, each proposing proof steps checked by the assistant, should benefit from the same boosting dynamic observed on code repair. This is an area where the framework has theoretical support but the empirical results are not yet in the paper; verification would be a high-value experiment.

Where to go
from here.

If you want to go deeper or test the ideas yourself.

1
Read the paper
Sunkaraneni, V., Beneventano, P., Neumarker, R., Poggio, T., & Galanti, T. (2026). Agentic Systems as Boosting Weak Reasoning Models. MIT & Texas A&M University. arXiv:2605.14163.
2
Run the pattern on a code task you already have
Take any coding task with a test suite. Run the same model four to eight times with varied temperatures or system prompts. Use test pass rate as the local verifier to select among outputs. Measure accuracy versus a single-call baseline. This is the minimum viable version of the committee pattern, and it tells you immediately whether your task has sufficient coverage and identifiability.
3
Audit your existing committee for the four properties
If you already run best-of-N sampling or a critic layer, measure each property directly. Sample 30 tasks and track: how often does any model in the pool produce a correct answer? How often does the critic pick the wrong one when a right one exists? How much do the pool's outputs actually differ? The answers tell you where your marginal investment belongs.
4
Compare cost profiles against frontier single-call baselines
Run the same benchmark with your committee design and with the frontier model you'd otherwise use. Calculate per-task cost and accuracy for both, including orchestration overhead in the committee cost. The committee may offer a better position on the cost-accuracy curve for verifiable tasks. For non-verifiable tasks, the frontier model is likely the cleaner choice.
5
Read the classical boosting literature for theoretical grounding
Schapire (1990) "The Strength of Weak Learnability" and Freund and Schapire (1997) "A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting" are the theoretical predecessors the paper builds on. Understanding why boosting works in the classical supervised setting clarifies exactly what the inference-time version is borrowing and what conditions it must reconstruct without labels.