First surfaced in Tandemly Briefing — 2026-05-13.
When the tool set scales,
retrieval breaks first.
Researchers built the first MCP-native benchmark with 150+ interdependent stateful tools across seven domains. The result: three reproducible failure modes that explain why agents performing well on simple benchmarks collapse in production. The failure modes have names now, which gives teams building production agents a vocabulary for diagnosis.
Benchmarks test
what production doesn't look like.
Most agent benchmarks offer agents a small set of independent, stateless tools in a static environment. Production systems do not work that way.
A typical agent benchmark provides ten to twenty tools. The tools don't affect each other. The environment doesn't change between calls. When something goes wrong, it's usually obvious what went wrong. These conditions make benchmarks tractable, but they also mean that benchmark performance tells you very little about how an agent will behave once you wire it up to a real system.
In production, agents connect through MCP (Model Context Protocol, the now-standard protocol for linking AI assistants to external tools and data sources) to dozens or hundreds of tools. Those tools maintain state. A scheduling action changes what the calendar availability tool returns. An inventory update changes what the order fulfillment API reports. A failed search leaves context that affects how subsequent retrieval tools interpret the next query. Calling tools in the wrong order, or assuming a tool call succeeded without verifying, can silently corrupt everything that follows.
There was no benchmark built to test agents under these conditions. Without one, you can't systematically measure how agents fail at scale. And without a shared vocabulary for the failure modes, teams observing breakdowns in deployment have no framework for describing what they're seeing, let alone fixing it.
What happens when we build a benchmark that actually matches production conditions: large tool sets, stateful tools, interdependent state, and environments that can fail in controlled and reproducible ways? What failure modes emerge, and do they generalize across models?
150 tools, 7 domains,
seed-controlled failures.
ComplexMCP is the first benchmark built on MCP itself. The researchers constructed seven sandbox domains, each containing tools that maintain persistent state and affect the state of other tools in the same domain.
The seven domains cover the kinds of tool-heavy workflows where agents actually get deployed: project management, scheduling, e-commerce operations, file handling, communication, code execution, and search-driven knowledge retrieval. Within each domain, the tools are interdependent. Updating a task status changes what the project dashboard returns. Booking a calendar slot changes what availability queries surface. The tools are not isolated components that can be tested one at a time.
Crucially, the researchers introduced seed-controlled environmental perturbations, including simulated API failures. This matters more than it might seem. Most benchmarks either inject failures randomly (making results hard to compare across models) or not at all. Seed control means the same failure scenario can be reproduced across every model being tested, making comparative results meaningful. A model that struggles with a specific API failure pattern can be compared directly against another model in the exact same conditions.
Building the benchmark on MCP rather than a custom tool interface means the results are directly relevant to teams using MCP in production. The failure modes surfaced are not artifacts of a proprietary test harness. They emerge from the same protocol layer that production agents use.
Three failure modes,
each with a name.
Three patterns emerged consistently enough across frontier models to name. Each names a distinct mechanism. Knowing which mode a failure belongs to points toward different fixes.
When the tool count is high, agents fail at the first step: identifying the correct tool to call. The retrieval mechanism that finds candidate tools from the task description hits a point where the density of similar-sounding options causes the agent to select the wrong tool or to hedge across multiple tools without committing. This failure happens before any planning or execution begins. It is architectural, not reasoning-based. The fix is not to prompt the agent better; it's to improve how tools are surfaced and disambiguated at retrieval time.
This failure mode is invisible in benchmarks with small tool sets, because ten distinct tools don't compete with each other the way 150 overlapping ones do.
Agents assume a tool call succeeded without checking its output or the downstream state it should have changed. In a stateless environment with independent tools, this is rarely catastrophic: the next tool call either works or fails visibly. In a stateful interdependent system, assuming success and moving forward corrupts subsequent steps in ways that can be traced back to the original unverified call only with careful inspection. The failure looks like a reasoning error or a wrong final answer. The actual cause is a skipped check several steps earlier.
The name is precise: the agent isn't uncertain about whether it checked. It is confident it succeeded, and that confidence is what causes it to skip the verification step entirely.
When agents encounter failures, they sometimes abandon the goal rather than trying an alternative path. This is distinct from rational early exit (recognizing that a goal is not achievable and stopping). Strategic defeatism happens when alternative paths exist and the agent doesn't explore them. The failure triggers resignation, not rerouting. The pattern is most visible when API failures are injected: an agent that could recover by trying a fallback tool or adjusting its approach instead stops and reports failure.
Frontier models showed this failure across the benchmark's perturbation scenarios, suggesting it is a property of how these models respond to error signals rather than a specific capability gap.
The benchmark tests specific MCP-connected tool environments. Findings generalize to production agent architectures built on MCP-connected tool sets of comparable scale, but the exact failure rates depend on the domain, the tool descriptions, and the model. The three failure modes are reproducible across frontier models tested; they are not model-specific observations. Performance numbers vary by model and domain, and the paper should be consulted for specific figures.
What this means
for building production agents.
The most direct contribution of this paper is vocabulary. Teams observing failures in production agent deployments now have three named patterns to check against, each pointing to a different layer of the architecture.
Where to go
from here.
Concrete next steps if you want to apply these findings or go deeper into the research.