One agent thinking,
not five.
First surfaced in Tandemly Briefing — 2026-05-06.
Two researchers at Stanford ran a clean experiment. They gave a single language model and several multi-agent setups the exact same compute budget, then asked them to do hard reasoning. The crowd of agents did not pull ahead. In most settings the lone agent won, sometimes by a lot.
Compute, not coordination,
was doing the work.
A short version of the finding before we get into how they showed it.
Recent papers have reported that orchestrating several language models together (multi-agent systems, or MAS) outperforms putting one model on the job (single-agent systems, or SAS). Stanford researchers Dat Tran and Douwe Kiela noticed that those comparisons rarely controlled for one obvious thing. Multi-agent systems usually burn more thinking tokens. They run more model calls, generate more intermediate text, and consume more compute. Of course they do better. They are bigger.
So Tran and Kiela held the thinking-token budget constant and ran the experiment again. They tested three model families (Qwen3, DeepSeek, and Gemini 2.5), five multi-agent architectures (sequential, debate, ensemble, parallel-roles, subtask-parallel), and two multi-hop reasoning benchmarks (FRAMES and MuSiQue). At meaningful budgets, single-agent reasoning matched or beat every multi-agent variant. The advantage held across model sizes and benchmark choice.
This matters because the default architecture for serious agentic work has quietly drifted toward "use more agents." That drift was paid for partly by real coordination gains, and partly by more compute that no one was counting.
The benchmark didn't
pay for itself.
Multi-agent benchmarks have been winning. The catch is that a system with five agents in a debate loop is not running the same workload as a single agent answering once. Counting wins without counting compute makes the comparison meaningless.
The agentic-AI literature has spent the past two years describing how to chain language models together. There are planners that decompose tasks. Debaters that argue. Ensembles that vote. Critic models that catch their teammates' mistakes. Each paper reports that its multi-agent setup beats a single model on some task. The implicit framing has been clear: more agents, more reasoning, better answers.
What was rarely controlled for was the cost. A two-debater system with a critic and an aggregator runs four model calls before producing an answer. Each call generates its own chain of thought. The total thinking tokens spent is not comparable to a single model that gets one shot. So when MAS wins, you cannot tell whether the architecture is smarter or just more expensive.
Anthropic's own write-up on its multi-agent research system noted as much last year. So did several follow-up studies that began to control for tokens. But the field still lacked a clean theoretical case for why the comparison should go either way, and a careful empirical sweep across architectures and models. That gap is what this paper fills.
If you give a single agent and a multi-agent system the exact same number of thinking tokens, which one is actually better at multi-hop reasoning, and why? And under what conditions does the multi-agent setup deserve the extra coordination overhead?
Same budget,
different orchestrations.
The setup is conceptually simple. Pick a thinking-token budget. Give it to a single agent. Give the same total budget to a multi-agent system, split among its workers. Ask both the same hard question. Compare answers.
The authors define a thinking-token budget as the total tokens a system can use for intermediate reasoning, not counting the prompt or the final answer. This is the part that scales with how much "thought" goes into a problem. Holding it constant is the experimental discipline that prior work skipped.
For benchmarks, they used FRAMES and the four-hop subset of MuSiQue. Both ask multi-hop questions where the model has to chain several facts together to reach the answer. They ran the comparison across three model families. Qwen3-30B and DeepSeek-R1-Distill-Llama-70B are open-source. Gemini 2.5 Flash and Pro are closed. They tested at six budget levels, from 100 thinking tokens (essentially nothing) up to 10,000.
The single-agent setup was a single call: one prompt, one continuous reasoning trace, one answer. The multi-agent side was richer, with five distinct designs.
Tran and Kiela also offer an information-theoretic argument grounded in the Data Processing Inequality. The intuition is short: a multi-agent system's intermediate messages are a function of the original context, and any function of the context can only lose information, never add to it. So a single agent with the full context is at least as well-positioned as any multi-agent system summarizing parts of that context to itself. Multi-agent designs become competitive only when the single agent cannot effectively use the full context. That prediction shows up later in the degradation experiments.
One agent,
thinking longer.
Across model families, datasets, and budget levels, the same pattern shows up. Once compute is held constant, the multi-agent advantage does not survive.
Above 100 thinking tokens (where neither approach really reasons), the single-agent system is either the best system or statistically indistinguishable from the best across every model family and dataset combination. It also wins at lower compute. Multi-agent variants spend their token budget on inter-agent messages, planner overhead, and aggregation passes. That overhead is real and it is not free.
The single-agent setup also consumes far fewer thinking tokens than any multi-agent variant while reaching the same or better accuracy. Same answer quality, less compute.
For most models and architectures, accuracy improves rapidly between 500 and 2000 thinking tokens, then plateaus. Beyond a few thousand thinking tokens, throwing more compute at the problem stops helping. The plateau is roughly the same height for SAS and the best MAS variants. So the question is not whether more compute helps. It is whether multi-agent structure spends compute more efficiently than a single chain of thought. The data says it does not.
Among the multi-agent variants, Debate is the most consistent performer, often the strongest MAS architecture and occasionally tying with SAS. Parallel-roles is competitive on FRAMES and on Gemini Pro. Ensemble is weaker at low and medium budgets and only becomes competitive at very high budgets on certain Gemini configurations.
The authors' theory predicts a specific failure mode for single agents: when the model's effective context utilization degrades, the single-agent advantage should shrink. They tested this directly. They corrupted the input context for Qwen3-30B in four ways: random deletion, random masking, substitution with noise, and addition of distractor sentences. Then they reran SAS versus Sequential MAS at a fixed 1000-token budget.
Under heavy substitution (corrupting 70% of the context with random tokens), Sequential MAS pulled ahead of SAS. Under heavy masking, the systems converged at moderate noise and Sequential won at heavy noise. This is the regime where multi-agent structure earns its keep: the single agent gets confused by the noisy context, while the structured pipeline filters and decomposes its way to a more reliable answer. In clean-context settings, that regime is rare.
One of the more uncomfortable findings is buried in the appendix. The authors discovered that Gemini's thinkingBudget parameter, which is supposed to cap how many tokens the model uses for reasoning, does not behave like a hard cap. It is documented as a guide. Actual visible-thought output often falls well below the requested budget, and the API-reported token count does not always match the visible reasoning text. In practice, this means budget-controlled comparisons that rely on Gemini's reported budgets can quietly under-credit single-agent systems and over-credit multi-agent ones, since multi-agent systems make multiple calls and surface more reasoning text under the same nominal budget.
The authors flag this as a methodological hazard for the field. Several published MAS comparisons are likely affected.
The study focuses on multi-hop world-knowledge reasoning, not tool use, code generation, or open-ended planning. It tests three model families and two benchmarks. The authors are explicit that their claim is bounded: SAS dominates under matched budgets and proper context utilization, and MAS becomes competitive when single-agent context use deteriorates enough. The general claim is not "multi-agent is always wrong." It is "the burden of proof has shifted." Anyone reporting MAS gains now needs to show those gains under matched compute.
What this means
for building agents.
If you are deciding whether to ship a multi-agent system, this paper changes the default. It does not say multi-agent is bad. It says the proof that you need it has to be more careful than it has been.
Where to go
from here.
If you want to dig into the details.