Agent Architecture · Reasoning Benchmarks

One agent thinking,
not five.

First surfaced in Tandemly Briefing — 2026-05-06.

Two researchers at Stanford ran a clean experiment. They gave a single language model and several multi-agent setups the exact same compute budget, then asked them to do hard reasoning. The crowd of agents did not pull ahead. In most settings the lone agent won, sometimes by a lot.

Core finding
When you measure thinking tokens, not wall-clock latency or invocation count, the multi-agent advantage on multi-hop reasoning largely vanishes. A single agent with the full context is the strongest default.
scroll to explore

Compute, not coordination,
was doing the work.

A short version of the finding before we get into how they showed it.

Recent papers have reported that orchestrating several language models together (multi-agent systems, or MAS) outperforms putting one model on the job (single-agent systems, or SAS). Stanford researchers Dat Tran and Douwe Kiela noticed that those comparisons rarely controlled for one obvious thing. Multi-agent systems usually burn more thinking tokens. They run more model calls, generate more intermediate text, and consume more compute. Of course they do better. They are bigger.

So Tran and Kiela held the thinking-token budget constant and ran the experiment again. They tested three model families (Qwen3, DeepSeek, and Gemini 2.5), five multi-agent architectures (sequential, debate, ensemble, parallel-roles, subtask-parallel), and two multi-hop reasoning benchmarks (FRAMES and MuSiQue). At meaningful budgets, single-agent reasoning matched or beat every multi-agent variant. The advantage held across model sizes and benchmark choice.

This matters because the default architecture for serious agentic work has quietly drifted toward "use more agents." That drift was paid for partly by real coordination gains, and partly by more compute that no one was counting.

The benchmark didn't
pay for itself.

Multi-agent benchmarks have been winning. The catch is that a system with five agents in a debate loop is not running the same workload as a single agent answering once. Counting wins without counting compute makes the comparison meaningless.

The agentic-AI literature has spent the past two years describing how to chain language models together. There are planners that decompose tasks. Debaters that argue. Ensembles that vote. Critic models that catch their teammates' mistakes. Each paper reports that its multi-agent setup beats a single model on some task. The implicit framing has been clear: more agents, more reasoning, better answers.

What was rarely controlled for was the cost. A two-debater system with a critic and an aggregator runs four model calls before producing an answer. Each call generates its own chain of thought. The total thinking tokens spent is not comparable to a single model that gets one shot. So when MAS wins, you cannot tell whether the architecture is smarter or just more expensive.

Anthropic's own write-up on its multi-agent research system noted as much last year. So did several follow-up studies that began to control for tokens. But the field still lacked a clean theoretical case for why the comparison should go either way, and a careful empirical sweep across architectures and models. That gap is what this paper fills.

The question this paper asks

If you give a single agent and a multi-agent system the exact same number of thinking tokens, which one is actually better at multi-hop reasoning, and why? And under what conditions does the multi-agent setup deserve the extra coordination overhead?

Same budget,
different orchestrations.

The setup is conceptually simple. Pick a thinking-token budget. Give it to a single agent. Give the same total budget to a multi-agent system, split among its workers. Ask both the same hard question. Compare answers.

The authors define a thinking-token budget as the total tokens a system can use for intermediate reasoning, not counting the prompt or the final answer. This is the part that scales with how much "thought" goes into a problem. Holding it constant is the experimental discipline that prior work skipped.

For benchmarks, they used FRAMES and the four-hop subset of MuSiQue. Both ask multi-hop questions where the model has to chain several facts together to reach the answer. They ran the comparison across three model families. Qwen3-30B and DeepSeek-R1-Distill-Llama-70B are open-source. Gemini 2.5 Flash and Pro are closed. They tested at six budget levels, from 100 thinking tokens (essentially nothing) up to 10,000.

The single-agent setup was a single call: one prompt, one continuous reasoning trace, one answer. The multi-agent side was richer, with five distinct designs.

Sequential
Closest analogue to a single agent
A planner decomposes the question into ordered steps. Workers handle each step in turn, each with a slice of the budget. An aggregator synthesizes the final answer. Same shape as one chain of thought, but with the chain externalized as messages between agents.
Subtask-parallel
Divide and conquer
A planner identifies a few independent sub-questions. Workers answer them in parallel under equal budget splits. An aggregator combines outputs.
Parallel-roles
Specialized cognition
The full question goes to four role-specialized workers: a Solver, a Second Solver, a Fact Extractor, and a Skeptic. Each contributes a perspective. An aggregator synthesizes.
Debate
Adversarial reasoning
Two debaters answer independently, then critique each other. An aggregator returns a final answer informed by both attempts and both critiques. Often the strongest MAS variant in the results.
Ensemble
Sample and select
Multiple workers answer the same question independently with higher sampling temperature, generating diverse candidates. A judge selects the best one.
SAS-L (single-agent variant)
More structured thinking, same budget
A single agent with a longer pre-answer scaffold: identify ambiguities, propose interpretations, evaluate, and only then answer. Same budget, slightly more deliberate output. Helps mainly with Gemini.
The theory before the data

Tran and Kiela also offer an information-theoretic argument grounded in the Data Processing Inequality. The intuition is short: a multi-agent system's intermediate messages are a function of the original context, and any function of the context can only lose information, never add to it. So a single agent with the full context is at least as well-positioned as any multi-agent system summarizing parts of that context to itself. Multi-agent designs become competitive only when the single agent cannot effectively use the full context. That prediction shows up later in the degradation experiments.

One agent,
thinking longer.

Across model families, datasets, and budget levels, the same pattern shows up. Once compute is held constant, the multi-agent advantage does not survive.

Prior framing
Multi-agent systems are smarter. Coordination, specialization, and debate produce better answers than one model alone. Use multi-agent designs by default for hard reasoning.
What the data shows
Multi-agent systems are bigger. When thinking tokens are held constant, a single agent matches or outperforms every multi-agent variant on multi-hop reasoning. The "smarter" part was largely the "bigger" part.
Average accuracy across all models & datasets
Higher is better. Each row holds thinking-token budget constant across all systems.
500 tokens
0.39
500 tokens
0.38
1000 tokens
0.42
1000 tokens
0.38
2000 tokens
0.42
2000 tokens
0.40
5000 tokens
0.43
5000 tokens
0.41
10000 tokens
0.43
10000 tokens
0.40
SAS (single agent) Best MAS variant per row
Finding 1: Single agents win at every meaningful budget

Above 100 thinking tokens (where neither approach really reasons), the single-agent system is either the best system or statistically indistinguishable from the best across every model family and dataset combination. It also wins at lower compute. Multi-agent variants spend their token budget on inter-agent messages, planner overhead, and aggregation passes. That overhead is real and it is not free.

The single-agent setup also consumes far fewer thinking tokens than any multi-agent variant while reaching the same or better accuracy. Same answer quality, less compute.

Finding 2: Coordination has diminishing returns

For most models and architectures, accuracy improves rapidly between 500 and 2000 thinking tokens, then plateaus. Beyond a few thousand thinking tokens, throwing more compute at the problem stops helping. The plateau is roughly the same height for SAS and the best MAS variants. So the question is not whether more compute helps. It is whether multi-agent structure spends compute more efficiently than a single chain of thought. The data says it does not.

Among the multi-agent variants, Debate is the most consistent performer, often the strongest MAS architecture and occasionally tying with SAS. Parallel-roles is competitive on FRAMES and on Gemini Pro. Ensemble is weaker at low and medium budgets and only becomes competitive at very high budgets on certain Gemini configurations.

Finding 3: Multi-agent systems do help, but only when context breaks

The authors' theory predicts a specific failure mode for single agents: when the model's effective context utilization degrades, the single-agent advantage should shrink. They tested this directly. They corrupted the input context for Qwen3-30B in four ways: random deletion, random masking, substitution with noise, and addition of distractor sentences. Then they reran SAS versus Sequential MAS at a fixed 1000-token budget.

Under heavy substitution (corrupting 70% of the context with random tokens), Sequential MAS pulled ahead of SAS. Under heavy masking, the systems converged at moderate noise and Sequential won at heavy noise. This is the regime where multi-agent structure earns its keep: the single agent gets confused by the noisy context, while the structured pipeline filters and decomposes its way to a more reliable answer. In clean-context settings, that regime is rare.

Finding 4: API budget controls are not always honest

One of the more uncomfortable findings is buried in the appendix. The authors discovered that Gemini's thinkingBudget parameter, which is supposed to cap how many tokens the model uses for reasoning, does not behave like a hard cap. It is documented as a guide. Actual visible-thought output often falls well below the requested budget, and the API-reported token count does not always match the visible reasoning text. In practice, this means budget-controlled comparisons that rely on Gemini's reported budgets can quietly under-credit single-agent systems and over-credit multi-agent ones, since multi-agent systems make multiple calls and surface more reasoning text under the same nominal budget.

The authors flag this as a methodological hazard for the field. Several published MAS comparisons are likely affected.

Scope and limitations

The study focuses on multi-hop world-knowledge reasoning, not tool use, code generation, or open-ended planning. It tests three model families and two benchmarks. The authors are explicit that their claim is bounded: SAS dominates under matched budgets and proper context utilization, and MAS becomes competitive when single-agent context use deteriorates enough. The general claim is not "multi-agent is always wrong." It is "the burden of proof has shifted." Anyone reporting MAS gains now needs to show those gains under matched compute.

What this means
for building agents.

If you are deciding whether to ship a multi-agent system, this paper changes the default. It does not say multi-agent is bad. It says the proof that you need it has to be more careful than it has been.

1
For builders shipping reasoning systems
Default to a single agent with a generous thinking budget for multi-hop reasoning over clean inputs. Reach for multi-agent orchestration when the input context is noisy, adversarial, or too long for the model to use effectively. The crossover happens at heavy degradation, not at moderate complexity. If your context is well-curated, the simpler architecture is likely the better one.
2
For researchers comparing architectures
Report thinking tokens, not just wall-clock time or invocation count. Treat the Gemini thinkingBudget parameter as a soft hint rather than a binding cap, and instrument actual token usage. If your MAS beats SAS only because it spent more compute, the headline is about compute, not architecture.
3
For engineering leaders evaluating agentic platforms
Many vendor pitches lean on "we use multi-agent" as a quality signal. Ask what they are spending in tokens to get there. If their multi-agent advantage disappears at matched compute, you may be paying for orchestration overhead that does not buy you anything. The interesting question is whether the system uses its compute well, not how many agents it has.
4
A note on what the paper does not claim
This is a study of multi-hop reasoning, not of every agentic task. Multi-agent systems still have plausible advantages in tool-use settings where parallel calls genuinely save wall-clock time, in tasks that benefit from explicit role separation, and in domains where the context is too messy for any single pass to handle. The finding sharpens, but does not erase, the case for multi-agent architectures. It just means the case has to be made on the merits, not on the benchmark.

Where to go
from here.

If you want to dig into the details.

1
Read the paper
Tran, D., & Kiela, D. (2026). Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets. Stanford University. arXiv:2604.02460.
2
Audit one of your own MAS comparisons
Pick a benchmark where your multi-agent setup beats a single-agent baseline. Re-run the comparison with thinking tokens held constant. If the win disappears, the architecture was not the differentiator. The exercise takes a few hours and tends to clarify roadmaps.
3
Try the paper's two benchmarks
FRAMES (Krishna et al., 2025) and MuSiQue 4-hop (Trivedi et al., 2022) are the multi-hop reasoning benchmarks used here. Both are openly available and good stress tests for any reasoning system you are building.
4
Read the related budget-aware studies
Wang et al. (2024), Han et al. (2025) on token-budget-aware reasoning, and Anthropic's "How we built our multi-agent research system" (2025) all touch the same question from different angles. The Cemri et al. (2025) paper on why multi-agent systems fail is a useful companion piece.
5
Test the context-degradation regime
If you suspect your real workload involves noisy or adversarial input, replicate the paper's degradation study on your own data. Run SAS and Sequential MAS with masked or substituted context. The crossover point tells you when the orchestration overhead actually starts paying for itself.