Not every query
needs an agent.
Running a full autonomous agent on every request is expensive and often unnecessary. Researchers at Princeton, Michigan, and Tsinghua built a system that learns which queries deserve full agent execution and which the model can handle alone, starting from just a small seed set of examples.
First surfaced in Tandemly Briefing — 2026-05-15.
Agents are powerful.
They are also slow.
Full agent execution chains together tool calls, search steps, and multi-turn reasoning loops. For the queries that actually need that, it's worth it. For the ones that don't, it's pure overhead.
LLM agents have gotten remarkably capable. Give an agent a hard multi-step task and it will search the web, run code, call APIs, and iterate until it reaches an answer. That capability comes with a real cost: latency, compute spend, and all the things that can go wrong across a longer chain of execution.
The trouble is that many queries don't need any of that. A direct language model call, with no tool use, no planning loops, handles a substantial portion of what production systems receive. The trick is knowing which queries are which before you've already spent the compute to find out.
Existing routing approaches require labeled training data or rely on simple prompt heuristics. Neither handles the cold-start case well: what do you do when you're deploying a new agent and have no historical routing data yet? Most teams default to always running the agent, which is safe but wasteful, or always running the base model, which is cheap but misses queries that genuinely need the agent.
Can a routing system learn to distinguish agent-worthy queries from model-sufficient ones using only a small set of early examples, without any training or labeled data? And how well does that routing hold up when query phrasings shift, or when the topic domain changes entirely?
An experience memory
built from a seed set.
BoundaryRouter solves the cold-start problem in three steps: run both systems on a small set of representative queries, store what you learn, then retrieve the most relevant experiences when a new query arrives.
The core idea is to treat early query experience as a resource. Before deployment, you pick a modest seed set of queries that cover the range of difficulty you expect to see. You run both the plain language model and the full agent on every query in the seed set and record the outcome: which system did better, and by how much. This becomes the experience memory.
At inference time, when a new query arrives, BoundaryRouter retrieves the most similar past experiences from that memory. It then applies what the paper calls rubric-guided reasoning: a structured scoring process that evaluates the new query against the retrieved examples to decide which system is more likely to handle it well. The rubric is not just similarity matching. It factors in the type of reasoning the query requires and whether the agent's additional capabilities are actually relevant to that type.
To evaluate routing robustness, the authors built RouteBench with three splits: in-domain queries similar to the seed set, paraphrased queries with different wording but the same underlying task, and out-of-domain queries from topic areas not in the seed set. The three-split structure tests whether routing generalizes beyond surface-level similarity.
60% less time.
Better accuracy too.
BoundaryRouter doesn't just cut cost. It improves accuracy over always using the base model and outperforms simpler routing approaches by a meaningful margin.
Retrieval-only routing, which matches new queries to past experiences purely by similarity, outperforms simple prompt heuristics. But BoundaryRouter's rubric-guided reasoning step adds another 8.2 percentage points on top of that. The rubric isn't decorative: it captures something about query type and capability requirements that raw similarity misses.
The gap between prompt-based routing and BoundaryRouter is larger, at 37.9 percentage points. Pure prompt heuristics, without the experience memory, leave a lot of routing quality on the table.
Routing quality does degrade on out-of-domain queries compared to in-domain performance. The method generalizes better than retrieval-only approaches, which the paper attributes to the rubric's focus on task type and capability requirements rather than surface similarity. But a seed set that doesn't represent the full difficulty range of your deployment will produce routing errors. The system is honest about this limitation.
The experiments test routing between a plain LLM call and a single agent pipeline. Production systems often involve more options: multiple agents with different capabilities, tiered compute plans, or hybrid retrieval paths. How BoundaryRouter scales to more complex routing topologies is not directly addressed. The authors also note that seed set quality is the binding constraint on routing performance.
Who benefits, and
what they should do.
The cost gap between LLM inference and full agent execution is large and real. If you're not routing between them, the question is whether you're over-spending or under-serving, not whether you could improve.
Where to go
from here.
Concrete next steps for applying or extending this work.