Agent Architecture · Inference Optimization

Not every query
needs an agent.

Running a full autonomous agent on every request is expensive and often unnecessary. Researchers at Princeton, Michigan, and Tsinghua built a system that learns which queries deserve full agent execution and which the model can handle alone, starting from just a small seed set of examples.

Core concept

Cold-start routing: a training-free framework that builds an experience memory from early query examples to decide, per query, whether to route to a plain language model call or escalate to a full agent pipeline.

scroll to explore

First surfaced in Tandemly Briefing — 2026-05-15.

01The problem

Agents are powerful.
They are also slow.

Full agent execution chains together tool calls, search steps, and multi-turn reasoning loops. For the queries that actually need that, it's worth it. For the ones that don't, it's pure overhead.

LLM agents have gotten remarkably capable. Give an agent a hard multi-step task and it will search the web, run code, call APIs, and iterate until it reaches an answer. That capability comes with a real cost: latency, compute spend, and all the things that can go wrong across a longer chain of execution.

The trouble is that many queries don't need any of that. A direct language model call, with no tool use, no planning loops, handles a substantial portion of what production systems receive. The trick is knowing which queries are which before you've already spent the compute to find out.

Existing routing approaches require labeled training data or rely on simple prompt heuristics. Neither handles the cold-start case well: what do you do when you're deploying a new agent and have no historical routing data yet? Most teams default to always running the agent, which is safe but wasteful, or always running the base model, which is cheap but misses queries that genuinely need the agent.

The question this paper asks

Can a routing system learn to distinguish agent-worthy queries from model-sufficient ones using only a small set of early examples, without any training or labeled data? And how well does that routing hold up when query phrasings shift, or when the topic domain changes entirely?

02The experiment

An experience memory
built from a seed set.

BoundaryRouter solves the cold-start problem in three steps: run both systems on a small set of representative queries, store what you learn, then retrieve the most relevant experiences when a new query arrives.

The core idea is to treat early query experience as a resource. Before deployment, you pick a modest seed set of queries that cover the range of difficulty you expect to see. You run both the plain language model and the full agent on every query in the seed set and record the outcome: which system did better, and by how much. This becomes the experience memory.

At inference time, when a new query arrives, BoundaryRouter retrieves the most similar past experiences from that memory. It then applies what the paper calls rubric-guided reasoning: a structured scoring process that evaluates the new query against the retrieved examples to decide which system is more likely to handle it well. The rubric is not just similarity matching. It factors in the type of reasoning the query requires and whether the agent's additional capabilities are actually relevant to that type.

Seed set execution

A representative set of queries is run through both systems before deployment. The experience memory records which system performed better on each query and why. No training is required. The seed set is the only up-front investment.

Retrieval at inference

When a new query arrives, BoundaryRouter retrieves the most similar past experiences from memory. Similarity is computed against the seed queries, surfacing cases that most closely match the type of task the new query represents.

Rubric-guided routing decision

The retrieved examples feed a structured scoring step that evaluates whether the new query's requirements match the situations where the agent's extra capabilities actually helped. The router then sends the query to the appropriate system.

RouteBench: three levels of difficulty

To evaluate routing robustness, the authors built RouteBench with three splits: in-domain queries similar to the seed set, paraphrased queries with different wording but the same underlying task, and out-of-domain queries from topic areas not in the seed set. The three-split structure tests whether routing generalizes beyond surface-level similarity.

03Findings

60% less time.
Better accuracy too.

BoundaryRouter doesn't just cut cost. It improves accuracy over always using the base model and outperforms simpler routing approaches by a meaningful margin.

vs always-agent

60.6%

inference time reduction

vs always-LLM

28.6%

accuracy improvement

vs prompt-based routing

37.9%

performance improvement

Default approach

Always run the agent. Safe but expensive. Every query incurs the full cost of tool calls, planning loops, and multi-turn execution, regardless of whether the task needed any of it.

BoundaryRouter

Route per query. Queries the base model can handle go to the base model. Queries that need agent capabilities go to the agent. 60.6% of agent time saved. Accuracy improves over the always-LLM baseline.

The rubric adds real signal beyond retrieval alone

Retrieval-only routing, which matches new queries to past experiences purely by similarity, outperforms simple prompt heuristics. But BoundaryRouter's rubric-guided reasoning step adds another 8.2 percentage points on top of that. The rubric isn't decorative: it captures something about query type and capability requirements that raw similarity misses.

The gap between prompt-based routing and BoundaryRouter is larger, at 37.9 percentage points. Pure prompt heuristics, without the experience memory, leave a lot of routing quality on the table.

OOD generalization: robust but not complete

Routing quality does degrade on out-of-domain queries compared to in-domain performance. The method generalizes better than retrieval-only approaches, which the paper attributes to the rubric's focus on task type and capability requirements rather than surface similarity. But a seed set that doesn't represent the full difficulty range of your deployment will produce routing errors. The system is honest about this limitation.

Scope and assumptions

The experiments test routing between a plain LLM call and a single agent pipeline. Production systems often involve more options: multiple agents with different capabilities, tiered compute plans, or hybrid retrieval paths. How BoundaryRouter scales to more complex routing topologies is not directly addressed. The authors also note that seed set quality is the binding constraint on routing performance.

04Practical takeaways

Who benefits, and
what they should do.

The cost gap between LLM inference and full agent execution is large and real. If you're not routing between them, the question is whether you're over-spending or under-serving, not whether you could improve.

For developers building agent pipelines

Profile your current agent usage before adding a router. Audit what fraction of queries actually needed agent-specific capabilities (tool calls, multi-step reasoning, API access) versus what a direct model call could have handled. If less than 30% genuinely required the agent, routing has large returns waiting. BoundaryRouter's seed-set approach gives you a path to deploy routing on day one, without waiting to accumulate historical labels.

For teams in cold-start conditions

The cold-start framing is the most directly applicable contribution here. If you've been deferring routing because you don't have enough labeled routing history, the seed-set approach offers a lower-cost alternative. A small, representative seed set run through both systems before deployment is enough to start. Build the seed set intentionally: include queries that span the difficulty range you expect in production, not just hard ones.

For evaluators and benchmark designers

RouteBench's three-split structure (in-domain, paraphrased, OOD) is worth borrowing as an evaluation pattern. Testing routing only on in-domain queries flatters approaches that don't generalize. If your routing evaluation doesn't include paraphrase and OOD splits, you may be overestimating how well your router will hold up in a changing production environment.

A note on scope

BoundaryRouter routes between two options: plain LLM and one agent. Most production systems are more complex. The paper establishes a clean baseline for the two-option case; applying the same principles to multi-agent routing topologies or tiered compute plans would require additional work beyond what's presented here.

05Further exploration

Where to go
from here.

Concrete next steps for applying or extending this work.

Read the paper

Wang, Y., Qiu, J., Qi, X., Juan, X., Shi, J., Zhao, Z., Wang, H., Liu, S., & Wang, M. (2026). Learning Agent Routing From Early Experience. Princeton University, University of Michigan, Tsinghua University (IIIS), Shanghai Jiao Tong University, University of Edinburgh, King's College London. arXiv:2605.07180.

Profile your agent usage

Before building a router, log a sample of real production queries and manually classify which ones actually used agent-specific capabilities versus which could have been answered by a direct model call. This audit tells you the upper bound on routing gains and whether the investment is worth making.

Build a seed set that spans difficulty

The quality ceiling for BoundaryRouter is the representativeness of the seed set. Select seed queries that include a genuine mix of easy tasks (direct model call sufficient), medium tasks (borderline), and hard tasks (agent required). Seed sets made only of hard queries will produce a router that over-escalates.

Test routing on OOD splits

Evaluate your router against queries from different topic areas or phrasings than your seed set. RouteBench's OOD split is a reusable template for this. A router that doesn't generalize will degrade silently in production as your query distribution drifts. Catching this before deployment is worth the evaluation time.

Compare with the related work in the queue

BoundaryRouter pairs naturally with LaTER (latent-space reasoning reduction) and the Dual-Dimensional Consistency paper on self-consistency efficiency. Together they form a cluster on the same theme: spending inference compute where it's needed rather than uniformly across all queries. Reading them together clarifies where each approach applies.

Not every queryneeds an agent.

Agents are powerful.They are also slow.

An experience memorybuilt from a seed set.

60% less time.Better accuracy too.

Who benefits, andwhat they should do.

Where to gofrom here.