First surfaced in Tandemly Briefing — 2026-05-13.

Agent Evaluation · MCP Benchmarks

When the tool set scales,
retrieval breaks first.

Researchers built the first MCP-native benchmark with 150+ interdependent stateful tools across seven domains. The result: three reproducible failure modes that explain why agents performing well on simple benchmarks collapse in production. The failure modes have names now, which gives teams building production agents a vocabulary for diagnosis.

Core concept
Tool-retrieval saturation: when agents must select from 150+ tools, finding the right one from overlapping descriptions becomes the first failure point. Everything downstream of a wrong tool selection is then built on a broken foundation.
scroll to explore

Benchmarks test
what production doesn't look like.

Most agent benchmarks offer agents a small set of independent, stateless tools in a static environment. Production systems do not work that way.

A typical agent benchmark provides ten to twenty tools. The tools don't affect each other. The environment doesn't change between calls. When something goes wrong, it's usually obvious what went wrong. These conditions make benchmarks tractable, but they also mean that benchmark performance tells you very little about how an agent will behave once you wire it up to a real system.

In production, agents connect through MCP (Model Context Protocol, the now-standard protocol for linking AI assistants to external tools and data sources) to dozens or hundreds of tools. Those tools maintain state. A scheduling action changes what the calendar availability tool returns. An inventory update changes what the order fulfillment API reports. A failed search leaves context that affects how subsequent retrieval tools interpret the next query. Calling tools in the wrong order, or assuming a tool call succeeded without verifying, can silently corrupt everything that follows.

There was no benchmark built to test agents under these conditions. Without one, you can't systematically measure how agents fail at scale. And without a shared vocabulary for the failure modes, teams observing breakdowns in deployment have no framework for describing what they're seeing, let alone fixing it.

The question this paper asks

What happens when we build a benchmark that actually matches production conditions: large tool sets, stateful tools, interdependent state, and environments that can fail in controlled and reproducible ways? What failure modes emerge, and do they generalize across models?

150 tools, 7 domains,
seed-controlled failures.

ComplexMCP is the first benchmark built on MCP itself. The researchers constructed seven sandbox domains, each containing tools that maintain persistent state and affect the state of other tools in the same domain.

The seven domains cover the kinds of tool-heavy workflows where agents actually get deployed: project management, scheduling, e-commerce operations, file handling, communication, code execution, and search-driven knowledge retrieval. Within each domain, the tools are interdependent. Updating a task status changes what the project dashboard returns. Booking a calendar slot changes what availability queries surface. The tools are not isolated components that can be tested one at a time.

Crucially, the researchers introduced seed-controlled environmental perturbations, including simulated API failures. This matters more than it might seem. Most benchmarks either inject failures randomly (making results hard to compare across models) or not at all. Seed control means the same failure scenario can be reproduced across every model being tested, making comparative results meaningful. A model that struggles with a specific API failure pattern can be compared directly against another model in the exact same conditions.

1
Scale: 150+ interdependent stateful tools
At this count, selecting the right tool from overlapping descriptions is no longer trivial. The retrieval mechanism that works at 20 tools may saturate at 150+, where multiple tools appear relevant to any given query and subtle semantic differences separate the correct call from a plausible-looking wrong one.
2
Interdependence: shared stateful environments
Tool calls within each domain affect a shared environment. This mirrors real systems and creates a failure dynamic that isolated tool benchmarks cannot surface: a wrong call doesn't just fail locally, it corrupts the state that later correct calls rely on. Debugging a stateful failure chain is categorically harder than debugging a stateless one.
3
Perturbations: seed-controlled API failures
The benchmark injects specific failure scenarios (API errors, unexpected responses) with seed control so the same conditions can be replicated across models. This allows valid comparison of how different frontier systems respond to the same failure, instead of averaging over random noise.
Why MCP-native matters

Building the benchmark on MCP rather than a custom tool interface means the results are directly relevant to teams using MCP in production. The failure modes surfaced are not artifacts of a proprietary test harness. They emerge from the same protocol layer that production agents use.

Three failure modes,
each with a name.

Three patterns emerged consistently enough across frontier models to name. Each names a distinct mechanism. Knowing which mode a failure belongs to points toward different fixes.

What benchmarks assumed
Tool selection is straightforward. Give the agent a small set of clearly differentiated tools and it will pick the right one. Failures are mostly downstream: wrong reasoning, wrong plan, wrong output format.
What the benchmark found
Tool selection saturates at scale. At 150+ tools, retrieval becomes the first failure point. Downstream reasoning failures compound from there. The source of the break shifts from execution to selection.
Failure mode 1: Tool-retrieval saturation

When the tool count is high, agents fail at the first step: identifying the correct tool to call. The retrieval mechanism that finds candidate tools from the task description hits a point where the density of similar-sounding options causes the agent to select the wrong tool or to hedge across multiple tools without committing. This failure happens before any planning or execution begins. It is architectural, not reasoning-based. The fix is not to prompt the agent better; it's to improve how tools are surfaced and disambiguated at retrieval time.

This failure mode is invisible in benchmarks with small tool sets, because ten distinct tools don't compete with each other the way 150 overlapping ones do.

Failure mode 2: Over-confidence skipping verification

Agents assume a tool call succeeded without checking its output or the downstream state it should have changed. In a stateless environment with independent tools, this is rarely catastrophic: the next tool call either works or fails visibly. In a stateful interdependent system, assuming success and moving forward corrupts subsequent steps in ways that can be traced back to the original unverified call only with careful inspection. The failure looks like a reasoning error or a wrong final answer. The actual cause is a skipped check several steps earlier.

The name is precise: the agent isn't uncertain about whether it checked. It is confident it succeeded, and that confidence is what causes it to skip the verification step entirely.

Failure mode 3: Strategic defeatism

When agents encounter failures, they sometimes abandon the goal rather than trying an alternative path. This is distinct from rational early exit (recognizing that a goal is not achievable and stopping). Strategic defeatism happens when alternative paths exist and the agent doesn't explore them. The failure triggers resignation, not rerouting. The pattern is most visible when API failures are injected: an agent that could recover by trying a fallback tool or adjusting its approach instead stops and reports failure.

Frontier models showed this failure across the benchmark's perturbation scenarios, suggesting it is a property of how these models respond to error signals rather than a specific capability gap.

Scope and limitations

The benchmark tests specific MCP-connected tool environments. Findings generalize to production agent architectures built on MCP-connected tool sets of comparable scale, but the exact failure rates depend on the domain, the tool descriptions, and the model. The three failure modes are reproducible across frontier models tested; they are not model-specific observations. Performance numbers vary by model and domain, and the paper should be consulted for specific figures.

What this means
for building production agents.

The most direct contribution of this paper is vocabulary. Teams observing failures in production agent deployments now have three named patterns to check against, each pointing to a different layer of the architecture.

1
For teams running agents with large tool sets
Map your production failures against the three modes before assuming the issue is model capability. If the agent is consistently calling the wrong tool, that's tool-retrieval saturation. If it's proceeding after tool failures and producing cascading errors, that's over-confidence skipping verification. If it stops trying when it hits an error instead of exploring alternatives, that's strategic defeatism. Each mode points to a different fix.
2
For teams doing model selection for tool-heavy deployments
Standard reasoning benchmarks don't predict performance on large-scale tool environments. A model that performs well on tasks with 10-15 tools may rank very differently at 150+. The benchmark methodology here (MCP-native, stateful, interdependent, seed-controlled failures) is more predictive of production behavior than most published evals. Consider building internal variants for your specific tool set before standardizing on a model.
3
For teams building agent evaluation infrastructure
The seed-controlled perturbation approach is the methodological contribution most worth adopting. The ability to reproduce specific failure scenarios means you can build regression tests around known failure modes rather than hoping they don't recur. Randomized failure injection produces noise; seeded failure injection produces comparable, debuggable results across model versions and configuration changes.
4
For architects designing tool retrieval systems
Tool-retrieval saturation suggests that the retrieval layer, not just the model, needs explicit design attention at production scale. Strategies worth investigating include hierarchical tool indexing (coarse category filter before fine-grained selection), clearer semantic differentiation in tool descriptions, and query-time disambiguation prompts that force the agent to articulate which tool it selected and why before calling it.

Where to go
from here.

Concrete next steps if you want to apply these findings or go deeper into the research.

1
Read the paper
Li, Yang, Wang et al. (2026). ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox. arXiv:2605.10787. The benchmark methodology, domain descriptions, and per-model performance breakdowns are the primary source for numbers and configuration details.
2
Audit your production tool set for retrieval saturation
Count the tools your agents have access to via MCP. If you're above 50, map how tool selection is currently handled. Ask whether your agent's tool retrieval is doing vector similarity search over descriptions, rule-based routing, or something else. Each approach has a different saturation profile and needs different mitigation at scale.
3
Build verification checkpoints into agent loops
Over-confidence skipping verification is partly an architectural choice. After each tool call in a stateful environment, require the agent to observe and confirm the result before proceeding. This can be a structured prompt step or a validation layer in the agent framework. The cost is a small number of additional tokens per step. The benefit is catching failed calls before they corrupt downstream state.
4
Add fallback routing to your agent's failure handling
Strategic defeatism is addressable at the framework level. When a tool call fails, the agent's error handling should include a step that explicitly asks: "Is there an alternative tool or approach that could accomplish this?" Building that question into the failure path, rather than leaving it to the model to generate spontaneously, reduces the rate at which agents give up when they shouldn't.
5
Explore the MCP ecosystem and protocol documentation
Understanding the protocol layer that this benchmark tests is useful context for interpreting the results. The Model Context Protocol specification and ecosystem are documented at modelcontextprotocol.io. The spec clarifies how tool descriptions, capabilities, and state are communicated between agent and server, which maps directly onto where the three failure modes originate.